five

ToxicCommons

收藏
魔搭社区2025-12-05 更新2025-06-21 收录
下载链接:
https://modelscope.cn/datasets/PleIAs/ToxicCommons
下载链接
链接失效反馈
官方服务:
资源简介:
# Toxic Commons Toxic Commons is a release of 2 million samples of annotated, public domain, multilingual text that was used to train [Celadon](https://huggingface.co/PleIAs/celadon). It is being released alongside Celadon, in order to better understand multilingual and multicultural toxicity. Each sample was classified across 5 axes of toxicity: * **Race and origin-based bias**: includes racism as well as bias against someone’s country or region of origin or immigration status, especially immigrant or refugee status. * **Gender and sexuality-based bias**: includes sexism and misogyny, homophobia, transphobia, and sexual harassment. * **Religious bias**: any bias or stereotype based on someone’s religion. * **Ability bias**: bias according to someone’s physical, mental, or intellectual ability or disability. * **Violence and abuse**: overly graphic descriptions of violence, threats of violence, or calls or incitement of violence. All 2 million samples were classified by a version of Llama 3.1 8B Instruct, with a [custom system prompt](https://github.com/eliotjones1/celadon/blob/main/prompts/annotate.txt). To replicate the annotation process on your own dataset, feel free to refer to our script [here](https://github.com/eliotjones1/celadon/blob/main/src/2.1_create_annotations.py), and re-create the prompt for your use case. Read more about the training details in the paper, [Toxicity of the Commons: Curating Open-Source Pre-Training Data](https://arxiv.org/pdf/2410.22587) by [Catherine Arnett](https://huggingface.co/catherinearnett), [Eliot Jones](https://huggingface.co/eliotj), [Ivan P. Yamshchikov](https://huggingface.co/ivan-the-bearable), [Pierre-Carl Langlais](https://huggingface.co/Pclanglais). For more detailed code regarding generating the annotations, please refer to the official [GitHub](https://github.com/Pleias/toxic-commons) repository. # How to Cite ``` @article{arnett2024toxicity, title={{Toxicity of the Commons: Curating Open-Source Pre-Training Data}}, author={Arnett, Catherine and Jones, Eliot and Yamshchikov, Ivan P. and Langlais, Pierre-Carl}, journal={arXiv preprint arXiv:2410.22587}, url={https://arxiv.org/pdf/2410.22587}, year={2024} } ``` # About Annotations were generated by [Eliot Jones](https://huggingface.co/eliotj) while working at [Pleias](https://huggingface.co/PleIAs). This project was made possible by Jean Zay compute grant #GC011015451.

# 有毒文本公共库(Toxic Commons) 有毒文本公共库(Toxic Commons)是一个包含200万条带标注的公有领域多语言文本样本的开源数据集,该数据集被用于训练[Celadon](https://huggingface.co/PleIAs/celadon)模型。本数据集与Celadon模型同步发布,旨在助力研究者深入理解多语言、多元文化语境下的文本有毒性问题。 每条样本均从5个有毒性维度进行标注分类: * **基于种族与出身的偏见**:涵盖种族主义,以及针对他人国籍、出身地区或移民身份(尤其是移民或难民身份)的偏见。 * **基于性别与性取向的偏见**:涵盖性别歧视、厌女症、恐同、跨性别恐惧及性骚扰行为。 * **基于宗教的偏见**:一切基于他人宗教信仰的偏见或刻板印象。 * **基于身心能力的偏见**:针对他人身体、心理或智力能力或残障状况的偏见。 * **暴力与虐待**:包含过于直白的暴力描述、暴力威胁,或煽动、教唆暴力的内容。 全部200万条样本均由Llama 3.1 8B Instruct模型结合[自定义系统提示词](https://github.com/eliotjones1/celadon/blob/main/prompts/annotate.txt)完成标注。若需在自有数据集上复现该标注流程,可参考我们发布的[脚本](https://github.com/eliotjones1/celadon/blob/main/src/2.1_create_annotations.py),并根据自身需求修改提示词。 如需了解更多训练细节,可参阅论文《[Toxicity of the Commons: Curating Open-Source Pre-Training Data](https://arxiv.org/pdf/2410.22587)》,作者为[Catherine Arnett](https://huggingface.co/catherinearnett)、[Eliot Jones](https://huggingface.co/eliotj)、[Ivan P. Yamshchikov](https://huggingface.co/ivan-the-bearable)及[Pierre-Carl Langlais](https://huggingface.co/Pclanglais)。如需获取标注生成相关的完整代码,请查阅官方[GitHub](https://github.com/Pleias/toxic-commons)仓库。 # 引用方式 @article{arnett2024toxicity, title={{Toxicity of the Commons: Curating Open-Source Pre-Training Data}}, author={Arnett, Catherine and Jones, Eliot and Yamshchikov, Ivan P. and Langlais, Pierre-Carl}, journal={arXiv preprint arXiv:2410.22587}, url={https://arxiv.org/pdf/2410.22587}, year={2024} } # 项目说明 本数据集的标注工作由任职于[Pleias](https://huggingface.co/PleIAs)的[Eliot Jones](https://huggingface.co/eliotj)完成。本项目得以顺利开展,得益于Jean Zay超算中心提供的计算资助(编号:GC011015451)。
提供机构:
maas
创建时间:
2025-06-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作