PleIAs/ToxicCommons
收藏Hugging Face2024-11-03 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/PleIAs/ToxicCommons
下载链接
链接失效反馈官方服务:
资源简介:
Toxic Commons是一个包含200万个注释样本的多语言公共领域文本数据集,用于训练Celadon模型,旨在更好地理解多语言和多文化中的毒性内容。每个样本根据五个毒性轴进行分类,包括种族和起源偏见、性别和性取向偏见、宗教偏见、能力偏见以及暴力和虐待。所有样本均由Llama 3.1 8B Instruct模型进行分类,并提供了生成注释的脚本和提示。
Toxic Commons is a dataset containing 2 million multilingual annotated texts in the public domain, used to train the Celadon model. The dataset aims to better understand toxicity in a multilingual and multicultural context. Each sample is classified across 5 axes of toxicity: race and origin-based bias, gender and sexuality-based bias, religious bias, ability bias, and violence and abuse. All samples were classified by the Llama 3.1 8B Instruct model using a custom system prompt. Detailed information about the dataset and the annotation process can be found in the related paper and GitHub repository.
提供机构:
PleIAs



