galileo-ai/20_Newsgroups_Fixed
收藏Hugging Face2022-10-25 更新2025-07-05 收录
下载链接:
https://hf-mirror.com/datasets/galileo-ai/20_Newsgroups_Fixed
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- crowdsourced
language_creators:
- crowdsourced
language:
- en
license:
- unknown
multilinguality:
- monolingual
pretty_name: 20_Newsgroups_Fixed
size_categories:
- 10K<n<100K
source_datasets:
- original
task_categories:
- text-classification
task_ids:
- multi-class-classification
- topic-classification
---
# Dataset Card for 20_Newsgroups_Fixed
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-instances)
- [Data Splits](#data-instances)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
## Dataset Description
- **Galileo Homepage:** [Galileo ML Data Intelligence Platform](https://www.rungalileo.io)
- **Repository:** [Needs More Information]
- **Dataset Blog:** [Improving Your ML Datasets With Galileo, Part 1](https://www.rungalileo.io/blog/)
- **Leaderboard:** [Needs More Information]
- **Point of Contact:** [Needs More Information]
- **Sklearn Dataset:** [sklearn](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html#the-20-newsgroups-text-dataset)
- **20 Newsgroups Homepage:** [newsgroups homepage](http://qwone.com/~jason/20Newsgroups/)
### Dataset Summary
This dataset is a version of the [**20 Newsgroups**](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html#the-20-newsgroups-text-dataset) dataset fixed with the help of the [**Galileo ML Data Intelligence Platform**](https://www.rungalileo.io/). In a matter of minutes, Galileo enabled us to uncover and fix a multitude of errors within the original dataset. In the end, we present this improved dataset as a new standard for natural language experimentation and benchmarking using the Newsgroups dataset.
### Curation Rationale
This dataset was created to showcase the power of Galileo as a Data Intelligence Platform. Through Galileo, we identify critical error patterns within the original Newsgroups training dataset - garbage data that do not properly fit any newsgroup label category. Moreover, we observe that these errors permeate throughout the test dataset.
As a result of our analysis, we propose the addition of a new class to properly categorize and fix the labeling of garbage data samples: a "None" class. Galileo further enables us to quickly make these data sample changes within the training set (changing garbage data labels to None) and helps guide human re-annotation of the test set.
#### Total Dataset Errors Fixed: 1163 *(6.5% of the dataset)*
|Errors / Split. |Overall| Train| Test|
|---------------------|------:|---------:|---------:|
|Garbage samples fixed| 718| 396| 322|
|Empty samples fixed | 445| 254| 254|
|Total samples fixed | 1163| 650| 650|
To learn more about the process of fixing this dataset, please refer to our [**Blog**](https://www.rungalileo.io/blog).
## Dataset Structure
### Data Instances
For each data sample, there is the text of the newsgroup post, the corresponding newsgroup forum where the message was posted (label), and a data sample id.
An example from the dataset looks as follows:
```
{'id': 1,
'text': 'I have win 3.0 and downloaded several icons and BMP\'s but I can\'t figure out\nhow to change the "wallpaper" or use the icons. Any help would be appreciated.\n\n\nThanx,\n\n-Brando'
'label': comp.os.ms-windows.misc}
```
### Data Fields
- id: the unique numerical id associated with a data sample
- text: a string containing the text of the newsgroups message
- label: a string indicating the newsgroup forum where the sample was posted
### Data Splits
The data is split into a training and test split. To reduce bias and test generalizability across time, data samples are split between train and test depending upon whether their message was posted before or after a specific date, respectively.
### Data Classes
The fixed data is organized into 20 newsgroup topics + a catch all "None" class. Some of the newsgroups are very closely related to each other (e.g. comp.sys.ibm.pc.hardware / comp.sys.mac.hardware), while others are highly unrelated (e.g misc.forsale / soc.religion.christian). Here is a list of the 21 classes, partitioned according to subject matter:
| comp.graphics<br>comp.os.ms-windows.misc<br>comp.sys.ibm.pc.hardware<br>comp.sys.mac.hardware<br>comp.windows.x | rec.autos<br>rec.motorcycles<br>rec.sport.baseball<br>rec.sport.hockey | sci.crypt<br><sci.electronics<br>sci.med<br>sci.space |
|:---|:---:|---:|
| misc.forsale | talk.politics.misc<br>talk.politics.guns<br>talk.politics.mideast | talk.religion.misc<br>alt.atheism<br>soc.religion.christian |
| None |
---
annotations_creators:
- 众包(crowdsourced)
language_creators:
- 众包(crowdsourced)
language:
- 英语(en)
license:
- 未知(unknown)
multilinguality:
- 单语言(monolingual)
pretty_name: 20_Newsgroups_Fixed
size_categories:
- 10K<n<100K
source_datasets:
- 原始数据集(original)
task_categories:
- 文本分类(text-classification)
task_ids:
- 多分类(multi-class-classification)
- 主题分类(topic-classification)
---
# 20_Newsgroups_Fixed 数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集概览](#dataset-summary)
- [支持的任务与排行榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-instances)
- [数据划分](#data-instances)
- [数据集构建](#dataset-creation)
- [数据集构建初衷](#curation-rationale)
- [源数据](#source-data)
- [标注信息](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
## 数据集描述
- **Galileo机器学习数据智能平台(Galileo ML Data Intelligence Platform)主页:[Galileo ML Data Intelligence Platform](https://www.rungalileo.io)
- **代码仓库**:[信息待补充]
- **数据集博客**:[使用Galileo优化机器学习数据集(第一部分)](https://www.rungalileo.io/blog/)
- **排行榜**:[信息待补充]
- **联系方式**:[信息待补充]
- **scikit-learn数据集(sklearn)**:[sklearn](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html#the-20-newsgroups-text-dataset)
- **20新闻组(20 Newsgroups)主页:[newsgroups homepage](http://qwone.com/~jason/20Newsgroups/)
### 数据集概览
本数据集是借助Galileo机器学习数据智能平台(Galileo ML Data Intelligence Platform)优化修复后的[20新闻组(20 Newsgroups)](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html#the-20-newsgroups-text-dataset)数据集。Galileo仅需数分钟即可帮助我们发现并修复原始数据集中的大量错误。最终,我们推出此优化后的数据集,作为20新闻组类自然语言实验与基准测试的新标准。
### 数据集构建初衷
本数据集的构建初衷是展示Galileo作为数据智能平台的强大能力。通过Galileo,我们识别出原始20新闻组训练数据集中的关键错误模式——即无法适配任何新闻组标签类别的垃圾数据。此外,我们发现这些错误同样渗透至测试数据集中。
基于我们的分析,我们提议新增一个“无(None)”类别,以正确分类并修复垃圾数据样本的标签。Galileo还帮助我们快速完成训练集中的样本修正(将垃圾数据的标签改为None),并辅助指导人工对测试集进行重新标注。
#### 总修复数据集错误数:1163条(占数据集的6.5%)
|错误/数据集划分|总计|训练集|测试集|
|---------------------|------:|---------:|---------:|
|修复垃圾样本数| 718| 396| 322|
|修复空样本数 | 445| 254| 254|
|总修复样本数 | 1163| 650| 650|
若想了解修复此数据集的完整流程,请参阅我们的[**博客**](https://www.rungalileo.io/blog)。
## 数据集结构
### 数据实例
每个数据样本包含新闻组帖子的文本、该帖子所属的新闻组论坛(即标签),以及数据样本ID。
数据集示例如下:
{'id': 1,
'text': 'I have win 3.0 and downloaded several icons and BMP's but I can't figure out
how to change the "wallpaper" or use the icons. Any help would be appreciated.
Thanx,
-Brando'
'label': comp.os.ms-windows.misc}
### 数据字段
- id:数据样本对应的唯一数值ID
- text:包含新闻组帖子文本的字符串
- label:表示样本所属新闻组论坛的字符串
### 数据划分
本数据集划分为训练集与测试集。为减少偏差并测试跨时间泛化能力,我们根据样本发布时间是否早于或晚于特定日期,将数据划分为训练集与测试集。
### 数据类别
修复后的数据集包含20个新闻组主题类别 + 一个通用的“无(None)”类别。部分新闻组主题关联性较强(例如comp.sys.ibm.pc.hardware与comp.sys.mac.hardware),而部分主题关联性极弱(例如misc.forsale与soc.religion.christian)。以下为21个类别按主题分类如下:
| comp.graphics<br>comp.os.ms-windows.misc<br>comp.sys.ibm.pc.hardware<br>comp.sys.mac.hardware<br>comp.windows.x | rec.autos<br>rec.motorcycles<br>rec.sport.baseball<br>rec.sport.hockey | sci.crypt<br>sci.electronics<br>sci.med<br>sci.space |
|:---|:---:|---:|
| misc.forsale | talk.politics.misc<br>talk.politics.guns<br>talk.politics.mideast | talk.religion.misc<br>alt.atheism<br>soc.religion.christian |
| None |
提供机构:
galileo-ai



