On-device Toxicity Filter for Community Platform

Client: A technology company focused on enabling real-time moderation on client devices for messaging platforms.

Objective:

Develop a lightweight, real-time toxicity filter capable of operating on resource-constrained client devices such as laptops and phones. The model must classify messages into multiple toxicity levels (e.g., toxic, obscene, identity hate) while considering contextual nuances, such as self-referential comments. The system should leverage the Candle framework, support cross-platform compatibility (Windows, Mac, iOS, Android), and integrate seamlessly into client applications.

Challenges:

Real-Time Processing: Ensure the model runs locally without latency, ruling out large-scale models like GPT-4 for direct inference.
Contextual Sensitivity: Address nuanced toxicity detection, such as distinguishing self-deprecating comments from general toxicity.
Cross-Platform Deployment: Build for various operating systems with compatibility for native environments and potential GPU inference.
Continuous Improvement: Fine-tune the model to prevent "forgetting" previously learned toxic word associations while incorporating new training data for context handling.

Approach:

Data Collection and Augmentation:
- Utilized existing datasets, including Kaggle's Jigsaw Toxicity Classification and Google's Contextual Toxicity Dataset, supplemented with 300+ custom examples focusing on self-references and nuanced cases.
- Suggested using GPT-4 Turbo to classify large datasets (e.g., Discord messages) for additional training data, particularly for areas where the model and GPT-4 predictions disagree or lack confidence.
Model Training and Fine-Tuning:
- Developed and trained the initial model in Python using PyTorch, achieving better performance than Google's baseline models.
- Transitioned the model to ONNX format for deployment and optimized vocabulary size to address flagged issues, such as handling rare toxic terms and contractions.
- Addressed forgetting during fine-tuning by reintroducing random samples from the original training set to retain base toxic word recognition.
Platform-Specific Deployments:
- Created platform-specific binaries for Windows, Mac (M2), and Linux, using the Tract library for Rust-based inference.
- Delivered a lightweight command-line executable, optimized for low latency, achieving near-instant response times.
- Implemented a vocabulary pruning strategy to balance model size (currently ~900 MB uncompressed) and performance.
Testing and Iterative Improvements:
- Validated the model using three separate test sets to ensure robust performance across different use cases.
- Iteratively refined based on client feedback, resolving issues with self-referencing and other nuanced examples while maintaining recognition of obvious toxicity indicators.
- Proposed incorporating flagged community messages as labeled examples for continuous model improvement.
Future Enhancements:
- Explored GPU inference options using the Ort crate, pending stable support for iOS and Android.
- Suggested potential integration of image toxicity detection, recommending lightweight models like the nsfw crate or pre-trained external APIs, with plans for fine-tuning as datasets grow.

Results:

Delivered a working MVP with binaries for all major platforms.
Achieved nuanced classification, resolving issues with self-referential and long-form comments while maintaining acceptable performance on traditional toxicity markers.
Improved real-time performance, reducing latency to near-zero with a manageable model size for client-side deployment.
Provided actionable pathways for future data-driven improvements, including user feedback and targeted fine-tuning.

Technologies and Tools:

Python (PyTorch, ONNX) for model development.
Rust (Tract, Candle) for cross-platform, low-latency inference.
Datasets: Jigsaw Unintended Bias, Google Contextual Toxicity, GPT-4 Turbo classifications.

Client Impact:

The client appreciated the iterative approach, with significant progress toward a robust toxicity classification system. The project highlights include cross-platform compatibility, high efficiency, and contextual improvements, providing a strong foundation for future development in January.