r1char9/toxic-detox-pairs
收藏Hugging Face2025-12-04 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/r1char9/toxic-detox-pairs
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- es
- ru
- de
- hi
- en
- ar
- zh
- uk
- am
task_categories:
- text2text-generation
- style-transfer
size_categories:
- 100K<n<1M
license: mit
---
# MultiParaDetox-9L: A Multilingual Parallel Dataset for Text Detoxification
## Dataset Description
**MultiParaDetox-9L** contains **109,985 parallel pairs** of toxic and human-rewritten neutral comments across **9 languages**: Spanish (es), Russian (ru), German (de), Hindi (hi), English (en), Arabic (ar), Chinese (zh), Ukrainian (uk), and Amharic (am).
### Key Features
* **Parallel Data**: Each entry is a `(toxic_comment, neutral_comment, lang)` triplet.
* **Multilingual Coverage**: 9 languages from diverse language families.
* **Human-Annotated**: Neutral versions created or validated by native speakers.
### Supported Tasks
* **Text Detoxification / Style Transfer**
* **Controlled Text Generation**
* **Multilingual NLP**
## Languages and Statistics
| Language | Code | Language Family | Exact Examples |
| :--- | :--- | :--- | :--- |
| Russian | `ru` | Slavic | 26,557 |
| English | `en` | Germanic | 19,228 |
| German | `de` | Germanic | 14,634 |
| Spanish | `es` | Romance | 14,494 |
| Ukrainian | `uk` | Slavic | 10,492 |
| Hindi | `hi` | Indo-Aryan | 9,447 |
| Chinese | `zh` | Sino-Tibetan | 6,290 |
| Arabic | `ar` | Semitic | 6,247 |
| Amharic | `am` | Semitic | 2,596 |
**Total Examples**: 109,985
## Dataset Structure
```python
{
'toxic_comment': 'string',
'neutral_comment': 'string',
'lang': 'string'
}
提供机构:
r1char9



