five

r1char9/toxic-detox-pairs

收藏
Hugging Face2025-12-04 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/r1char9/toxic-detox-pairs
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - es - ru - de - hi - en - ar - zh - uk - am task_categories: - text2text-generation - style-transfer size_categories: - 100K<n<1M license: mit --- # MultiParaDetox-9L: A Multilingual Parallel Dataset for Text Detoxification ## Dataset Description **MultiParaDetox-9L** contains **109,985 parallel pairs** of toxic and human-rewritten neutral comments across **9 languages**: Spanish (es), Russian (ru), German (de), Hindi (hi), English (en), Arabic (ar), Chinese (zh), Ukrainian (uk), and Amharic (am). ### Key Features * **Parallel Data**: Each entry is a `(toxic_comment, neutral_comment, lang)` triplet. * **Multilingual Coverage**: 9 languages from diverse language families. * **Human-Annotated**: Neutral versions created or validated by native speakers. ### Supported Tasks * **Text Detoxification / Style Transfer** * **Controlled Text Generation** * **Multilingual NLP** ## Languages and Statistics | Language | Code | Language Family | Exact Examples | | :--- | :--- | :--- | :--- | | Russian | `ru` | Slavic | 26,557 | | English | `en` | Germanic | 19,228 | | German | `de` | Germanic | 14,634 | | Spanish | `es` | Romance | 14,494 | | Ukrainian | `uk` | Slavic | 10,492 | | Hindi | `hi` | Indo-Aryan | 9,447 | | Chinese | `zh` | Sino-Tibetan | 6,290 | | Arabic | `ar` | Semitic | 6,247 | | Amharic | `am` | Semitic | 2,596 | **Total Examples**: 109,985 ## Dataset Structure ```python { 'toxic_comment': 'string', 'neutral_comment': 'string', 'lang': 'string' }
提供机构:
r1char9
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作