On-device Toxicity Filter for Community Platform

Client: A technology company focused on enabling real-time moderation on client devices for messaging platforms.

Objective:

Develop a lightweight, real-time toxicity filter capable of operating on resource-constrained client devices such as laptops and phones. The model must classify messages into multiple toxicity levels (e.g., toxic, obscene, identity hate) while considering contextual nuances, such as self-referential comments. The system should leverage the Candle framework, support cross-platform compatibility (Windows, Mac, iOS, Android), and integrate seamlessly into client applications.

Challenges:

  1. Real-Time Processing: Ensure the model runs locally without latency, ruling out large-scale models like GPT-4 for direct inference.
  2. Contextual Sensitivity: Address nuanced toxicity detection, such as distinguishing self-deprecating comments from general toxicity.
  3. Cross-Platform Deployment: Build for various operating systems with compatibility for native environments and potential GPU inference.
  4. Continuous Improvement: Fine-tune the model to prevent "forgetting" previously learned toxic word associations while incorporating new training data for context handling.

Approach:

  1. Data Collection and Augmentation:
    • Utilized existing datasets, including Kaggle's Jigsaw Toxicity Classification and Google's Contextual Toxicity Dataset, supplemented with 300+ custom examples focusing on self-references and nuanced cases.
    • Suggested using GPT-4 Turbo to classify large datasets (e.g., Discord messages) for additional training data, particularly for areas where the model and GPT-4 predictions disagree or lack confidence.
  2. Model Training and Fine-Tuning:
    • Developed and trained the initial model in Python using PyTorch, achieving better performance than Google's baseline models.
    • Transitioned the model to ONNX format for deployment and optimized vocabulary size to address flagged issues, such as handling rare toxic terms and contractions.
    • Addressed forgetting during fine-tuning by reintroducing random samples from the original training set to retain base toxic word recognition.
  3. Platform-Specific Deployments:
    • Created platform-specific binaries for Windows, Mac (M2), and Linux, using the Tract library for Rust-based inference.
    • Delivered a lightweight command-line executable, optimized for low latency, achieving near-instant response times.
    • Implemented a vocabulary pruning strategy to balance model size (currently ~900 MB uncompressed) and performance.
  4. Testing and Iterative Improvements:
    • Validated the model using three separate test sets to ensure robust performance across different use cases.
    • Iteratively refined based on client feedback, resolving issues with self-referencing and other nuanced examples while maintaining recognition of obvious toxicity indicators.
    • Proposed incorporating flagged community messages as labeled examples for continuous model improvement.
  5. Future Enhancements:
    • Explored GPU inference options using the Ort crate, pending stable support for iOS and Android.
    • Suggested potential integration of image toxicity detection, recommending lightweight models like the nsfw crate or pre-trained external APIs, with plans for fine-tuning as datasets grow.

Results:

  • Delivered a working MVP with binaries for all major platforms.
  • Achieved nuanced classification, resolving issues with self-referential and long-form comments while maintaining acceptable performance on traditional toxicity markers.
  • Improved real-time performance, reducing latency to near-zero with a manageable model size for client-side deployment.
  • Provided actionable pathways for future data-driven improvements, including user feedback and targeted fine-tuning.

Technologies and Tools:

  • Python (PyTorch, ONNX) for model development.
  • Rust (Tract, Candle) for cross-platform, low-latency inference.
  • Datasets: Jigsaw Unintended Bias, Google Contextual Toxicity, GPT-4 Turbo classifications.

Client Impact:

The client appreciated the iterative approach, with significant progress toward a robust toxicity classification system. The project highlights include cross-platform compatibility, high efficiency, and contextual improvements, providing a strong foundation for future development in January.