ToxicCommons

Name: ToxicCommons
Creator: maas
Published: 2025-12-05 16:39:03
License: 暂无描述

魔搭社区2025-12-05 更新2025-06-21 收录

下载链接：

https://modelscope.cn/datasets/PleIAs/ToxicCommons

下载链接

链接失效反馈

官方服务：

资源简介：

# Toxic Commons Toxic Commons is a release of 2 million samples of annotated, public domain, multilingual text that was used to train [Celadon](https://huggingface.co/PleIAs/celadon). It is being released alongside Celadon, in order to better understand multilingual and multicultural toxicity. Each sample was classified across 5 axes of toxicity: * **Race and origin-based bias**: includes racism as well as bias against someone’s country or region of origin or immigration status, especially immigrant or refugee status. * **Gender and sexuality-based bias**: includes sexism and misogyny, homophobia, transphobia, and sexual harassment. * **Religious bias**: any bias or stereotype based on someone’s religion. * **Ability bias**: bias according to someone’s physical, mental, or intellectual ability or disability. * **Violence and abuse**: overly graphic descriptions of violence, threats of violence, or calls or incitement of violence. All 2 million samples were classified by a version of Llama 3.1 8B Instruct, with a [custom system prompt](https://github.com/eliotjones1/celadon/blob/main/prompts/annotate.txt). To replicate the annotation process on your own dataset, feel free to refer to our script [here](https://github.com/eliotjones1/celadon/blob/main/src/2.1_create_annotations.py), and re-create the prompt for your use case. Read more about the training details in the paper, [Toxicity of the Commons: Curating Open-Source Pre-Training Data](https://arxiv.org/pdf/2410.22587) by [Catherine Arnett](https://huggingface.co/catherinearnett), [Eliot Jones](https://huggingface.co/eliotj), [Ivan P. Yamshchikov](https://huggingface.co/ivan-the-bearable), [Pierre-Carl Langlais](https://huggingface.co/Pclanglais). For more detailed code regarding generating the annotations, please refer to the official [GitHub](https://github.com/Pleias/toxic-commons) repository. # How to Cite ``` @article{arnett2024toxicity, title={{Toxicity of the Commons: Curating Open-Source Pre-Training Data}}, author={Arnett, Catherine and Jones, Eliot and Yamshchikov, Ivan P. and Langlais, Pierre-Carl}, journal={arXiv preprint arXiv:2410.22587}, url={https://arxiv.org/pdf/2410.22587}, year={2024} } ``` # About Annotations were generated by [Eliot Jones](https://huggingface.co/eliotj) while working at [Pleias](https://huggingface.co/PleIAs). This project was made possible by Jean Zay compute grant #GC011015451.

# 有毒文本公共库（Toxic Commons）有毒文本公共库（Toxic Commons）是一个包含200万条带标注的公有领域多语言文本样本的开源数据集，该数据集被用于训练[Celadon](https://huggingface.co/PleIAs/celadon)模型。本数据集与Celadon模型同步发布，旨在助力研究者深入理解多语言、多元文化语境下的文本有毒性问题。每条样本均从5个有毒性维度进行标注分类： * **基于种族与出身的偏见**：涵盖种族主义，以及针对他人国籍、出身地区或移民身份（尤其是移民或难民身份）的偏见。 * **基于性别与性取向的偏见**：涵盖性别歧视、厌女症、恐同、跨性别恐惧及性骚扰行为。 * **基于宗教的偏见**：一切基于他人宗教信仰的偏见或刻板印象。 * **基于身心能力的偏见**：针对他人身体、心理或智力能力或残障状况的偏见。 * **暴力与虐待**：包含过于直白的暴力描述、暴力威胁，或煽动、教唆暴力的内容。全部200万条样本均由Llama 3.1 8B Instruct模型结合[自定义系统提示词](https://github.com/eliotjones1/celadon/blob/main/prompts/annotate.txt)完成标注。若需在自有数据集上复现该标注流程，可参考我们发布的[脚本](https://github.com/eliotjones1/celadon/blob/main/src/2.1_create_annotations.py)，并根据自身需求修改提示词。如需了解更多训练细节，可参阅论文《[Toxicity of the Commons: Curating Open-Source Pre-Training Data](https://arxiv.org/pdf/2410.22587)》，作者为[Catherine Arnett](https://huggingface.co/catherinearnett)、[Eliot Jones](https://huggingface.co/eliotj)、[Ivan P. Yamshchikov](https://huggingface.co/ivan-the-bearable)及[Pierre-Carl Langlais](https://huggingface.co/Pclanglais)。如需获取标注生成相关的完整代码，请查阅官方[GitHub](https://github.com/Pleias/toxic-commons)仓库。 # 引用方式 @article{arnett2024toxicity, title={{Toxicity of the Commons: Curating Open-Source Pre-Training Data}}, author={Arnett, Catherine and Jones, Eliot and Yamshchikov, Ivan P. and Langlais, Pierre-Carl}, journal={arXiv preprint arXiv:2410.22587}, url={https://arxiv.org/pdf/2410.22587}, year={2024} } # 项目说明本数据集的标注工作由任职于[Pleias](https://huggingface.co/PleIAs)的[Eliot Jones](https://huggingface.co/eliotj)完成。本项目得以顺利开展，得益于Jean Zay超算中心提供的计算资助（编号：GC011015451）。

提供机构：

maas

创建时间：

2025-06-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集