textdetox/multilingual_toxicity_dataset

Name: textdetox/multilingual_toxicity_dataset
Creator: textdetox
Published: 2025-03-21 18:52:31
License: 暂无描述

Hugging Face2025-03-21 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/textdetox/multilingual_toxicity_dataset

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是为[CLEF TextDetox 2024]共享任务提供的多种语言的二元毒性分类数据集。每种语言包含5000个样本，其中2500个是有毒样本，2500个是非有毒样本。数据集的来源包括多个公开的毒性评论数据集，如Jigsaw、Unitary AI Toxicity Dataset等。数据集支持的语言包括英语、俄语、乌克兰语、德语、西班牙语、阿姆哈拉语、中文、阿拉伯语和印地语。

This dataset is a multilingual binary toxicity classification dataset prepared for the [CLEF TextDetox 2024] shared task. Each language consists of 5,000 samples, with 2,500 toxic samples and 2,500 non-toxic samples respectively. The dataset is sourced from multiple publicly available toxicity comment datasets, including Jigsaw, Unitary AI Toxicity Dataset, and others. Supported languages include English, Russian, Ukrainian, German, Spanish, Amharic, Chinese, Arabic, and Hindi.

提供机构：

textdetox

原始信息汇总

数据集概述

任务类型：二元毒性分类
数据集组成：
- 每种语言提供5000个子部分数据集
- 包含2500个毒性样本和2500个非毒性样本
适用场景：CLEF TextDetox 2024共享任务

5,000+

优质数据集

54 个

任务类型

进入经典数据集