civil_comments

Name: civil_comments
Creator: maas
Published: 2025-12-10 16:30:52
License: 暂无描述

魔搭社区2025-12-10 更新2025-04-26 收录

下载链接：

https://modelscope.cn/datasets/google/civil_comments

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for "civil_comments" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data) - **Repository:** https://github.com/conversationai/unintended-ml-bias-analysis - **Paper:** https://arxiv.org/abs/1903.04561 - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Size of downloaded dataset files:** 414.95 MB - **Size of the generated dataset:** 661.23 MB - **Total amount of disk used:** 1.08 GB ### Dataset Summary The comments in this dataset come from an archive of the Civil Comments platform, a commenting plugin for independent news sites. These public comments were created from 2015 - 2017 and appeared on approximately 50 English-language news sites across the world. When Civil Comments shut down in 2017, they chose to make the public comments available in a lasting open archive to enable future research. The original data, published on figshare, includes the public comment text, some associated metadata such as article IDs, timestamps and commenter-generated "civility" labels, but does not include user ids. Jigsaw extended this dataset by adding additional labels for toxicity and identity mentions. This data set is an exact replica of the data released for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge. This dataset is released under CC0, as is the underlying comment text. ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Dataset Structure ### Data Instances #### default - **Size of downloaded dataset files:** 414.95 MB - **Size of the generated dataset:** 661.23 MB - **Total amount of disk used:** 1.08 GB An example of 'validation' looks as follows. ``` { "identity_attack": 0.0, "insult": 0.0, "obscene": 0.0, "severe_toxicity": 0.0, "sexual_explicit": 0.0, "text": "The public test.", "threat": 0.0, "toxicity": 0.0 } ``` ### Data Fields The data fields are the same among all splits. #### default - `text`: a `string` feature. - `toxicity`: a `float32` feature. - `severe_toxicity`: a `float32` feature. - `obscene`: a `float32` feature. - `threat`: a `float32` feature. - `insult`: a `float32` feature. - `identity_attack`: a `float32` feature. - `sexual_explicit`: a `float32` feature. ### Data Splits | name | train |validation|test | |-------|------:|---------:|----:| |default|1804874| 97320|97320| ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information This dataset is released under [CC0 1.0](https://creativecommons.org/publicdomain/zero/1.0/). ### Citation Information ``` @article{DBLP:journals/corr/abs-1903-04561, author = {Daniel Borkan and Lucas Dixon and Jeffrey Sorensen and Nithum Thain and Lucy Vasserman}, title = {Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification}, journal = {CoRR}, volume = {abs/1903.04561}, year = {2019}, url = {http://arxiv.org/abs/1903.04561}, archivePrefix = {arXiv}, eprint = {1903.04561}, timestamp = {Sun, 31 Mar 2019 19:01:24 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/abs-1903-04561}, bibsource = {dblp computer science bibliography, https://dblp.org} } ``` ### Contributions Thanks to [@lewtun](https://github.com/lewtun), [@patrickvonplaten](https://github.com/patrickvonplaten), [@thomwolf](https://github.com/thomwolf) for adding this dataset.

# "civil_comments"数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [授权信息](#licensing-information) - [引用信息](#citation-information) - [贡献者](#contributions) ## 数据集描述 - **主页**：[https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data) - **仓库地址**：https://github.com/conversationai/unintended-ml-bias-analysis - **论文链接**：https://arxiv.org/abs/1903.04561 - **联系方式**：[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **下载数据集文件大小**：414.95 MB - **生成数据集大小**：661.23 MB - **总磁盘占用**：1.08 GB ### 数据集概述本数据集的评论源自独立新闻网站评论插件Civil Comments平台的存档。这些公开评论发布于2015年至2017年，覆盖全球约50个英语新闻站点。2017年Civil Comments平台关停后，运营团队将公开评论以永久开放存档的形式发布，以供后续研究使用。最初发布于figshare平台的原始数据包含公开评论文本、部分关联元数据（如文章ID、时间戳）以及评论者自行标注的“文明度”标签，但未包含用户身份信息。Jigsaw团队对该数据集进行了扩展，新增了毒性与身份提及相关的标注项。本数据集与Jigsaw发起的“Kaggle非预期偏差毒性分类挑战赛”所发布的官方数据完全一致。本数据集及其中的评论文本均采用CC0协议开放授权。 ### 支持任务与排行榜 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 语言 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集结构 ### 数据实例 #### 默认配置 - **下载数据集文件大小**：414.95 MB - **生成数据集大小**：661.23 MB - **总磁盘占用**：1.08 GB 「验证集」的一条数据示例如下： { "identity_attack": 0.0, "insult": 0.0, "obscene": 0.0, "severe_toxicity": 0.0, "sexual_explicit": 0.0, "text": "The public test.", "threat": 0.0, "toxicity": 0.0 } ### 数据字段所有数据划分下的字段均保持一致。 #### 默认配置 - `text`：字符串（string）类型特征。 - `toxicity`：float32类型特征。 - `severe_toxicity`：float32类型特征。 - `obscene`：float32类型特征。 - `threat`：float32类型特征。 - `insult`：float32类型特征。 - `identity_attack`：float32类型特征。 - `sexual_explicit`：float32类型特征。 ### 数据划分 | 划分名称 | 训练集 | 验证集 | 测试集 | |-------|------:|---------:|----:| |default|1804874| 97320|97320| ## 数据集构建 ### 构建初衷 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 源数据 #### 初始数据收集与标准化 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 源语言生产者群体 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 标注 #### 标注流程 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 标注人员群体 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 个人与敏感信息 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 偏差讨论 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 其他已知局限性 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 附加信息 ### 数据集维护者 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 授权信息本数据集采用[CC0 1.0通用公共领域授权](https://creativecommons.org/publicdomain/zero/1.0/)发布。 ### 引用信息 @article{DBLP:journals/corr/abs-1903.04561, author = {Daniel Borkan and Lucas Dixon and Jeffrey Sorensen and Nithum Thain and Lucy Vasserman}, title = {用于文本分类的真实数据非预期偏差度量的精细化指标}, journal = {CoRR}, volume = {abs/1903.04561}, year = {2019}, url = {http://arxiv.org/abs/1903.04561}, archivePrefix = {arXiv}, eprint = {1903.04561}, timestamp = {Sun, 31 Mar 2019 19:01:24 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/abs-1903.04561}, bibsource = {dblp计算机科学文献库, https://dblp.org} } ### 贡献者感谢[@lewtun](https://github.com/lewtun)、[@patrickvonplaten](https://github.com/patrickvonplaten)、[@thomwolf](https://github.com/thomwolf)为本数据集的收录工作。

提供机构：

maas

创建时间：

2025-04-21

5,000+

优质数据集

54 个

任务类型

进入经典数据集