five

entfane/toxic_chat

收藏
Hugging Face2026-03-01 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/entfane/toxic_chat
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - text-classification language: - en size_categories: - 100K<n<1M --- # Dataset Description This dataset is a filtered and restructured collection of toxic and sensitive content, specifically curated for training safety classifiers and content moderation models. It combines synthetic data from ToxiGen with real-world samples from ToxicDataset, specifically downsampled and cleaned for high-density training. ## Data Sources * **[ToxiGen (Train Split)](https://huggingface.co/datasets/toxigen/toxigen-data):** A large-scale synthetic dataset of toxic and benign statements about 13 minority groups. * **[AiActivity/ToxicDataset](https://huggingface.co/datasets/AiActivity/ToxicDataset) (10% Sample):** A diverse collection of toxic comments, randomly downsampled to 10% to balance the distribution. ## Preprocessing & Cleaning To ensure the data is "machine-ready" and consistent across different sources, the following transformations were applied: Prefix Removal: Stripped leading bullet points (- ) from the beginning of every sentence to normalize text structure. Escape Sequence Correction: Converted literal \\n- string sequences into standard newline characters and removed redundant leading dashes. Column Pruning: Removed all auxiliary metadata columns, retaining only the core text and label columns (the last 3 columns of the original merged set). Sampling & Merging: * Full train split of ToxiGen. 0.1 (10%) random sample of AiActivity/ToxicDataset. ## Shuffling: The final combined dataset was globally shuffled with a fixed seed (42) to ensure a diverse distribution of toxicity types across batches. ## Intended Use Primary Use: Training and fine-tuning LLM safety guardrails. Secondary Use: Benchmarking moderation APIs against historical and synthetic tropes. ## [!CAUTION] Warning: This dataset contains highly offensive, biased, and dehumanizing language. It is intended strictly for research and development of safety tools. Use with professional discretion. ## Dataset Schema The final dataset contains the following 3 columns: | Column Name | Type | Description | | :--- | :--- | :--- | | input | string | The cleaned input | | output | string | Output | | label | int | Toxicity classification (0: Neutral, 1: Toxic). |
提供机构:
entfane
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作