galileo-ai/20_Newsgroups_Fixed

Name: galileo-ai/20_Newsgroups_Fixed
Creator: galileo-ai
Published: 2022-10-25 10:25:50
License: 暂无描述

Hugging Face2022-10-25 更新2025-07-05 收录

下载链接：

https://hf-mirror.com/datasets/galileo-ai/20_Newsgroups_Fixed

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - crowdsourced language_creators: - crowdsourced language: - en license: - unknown multilinguality: - monolingual pretty_name: 20_Newsgroups_Fixed size_categories: - 10K<n<100K source_datasets: - original task_categories: - text-classification task_ids: - multi-class-classification - topic-classification --- # Dataset Card for 20_Newsgroups_Fixed ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-instances) - [Data Splits](#data-instances) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) ## Dataset Description - **Galileo Homepage:** [Galileo ML Data Intelligence Platform](https://www.rungalileo.io) - **Repository:** [Needs More Information] - **Dataset Blog:** [Improving Your ML Datasets With Galileo, Part 1](https://www.rungalileo.io/blog/) - **Leaderboard:** [Needs More Information] - **Point of Contact:** [Needs More Information] - **Sklearn Dataset:** [sklearn](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html#the-20-newsgroups-text-dataset) - **20 Newsgroups Homepage:** [newsgroups homepage](http://qwone.com/~jason/20Newsgroups/) ### Dataset Summary This dataset is a version of the [**20 Newsgroups**](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html#the-20-newsgroups-text-dataset) dataset fixed with the help of the [**Galileo ML Data Intelligence Platform**](https://www.rungalileo.io/). In a matter of minutes, Galileo enabled us to uncover and fix a multitude of errors within the original dataset. In the end, we present this improved dataset as a new standard for natural language experimentation and benchmarking using the Newsgroups dataset. ### Curation Rationale This dataset was created to showcase the power of Galileo as a Data Intelligence Platform. Through Galileo, we identify critical error patterns within the original Newsgroups training dataset - garbage data that do not properly fit any newsgroup label category. Moreover, we observe that these errors permeate throughout the test dataset. As a result of our analysis, we propose the addition of a new class to properly categorize and fix the labeling of garbage data samples: a "None" class. Galileo further enables us to quickly make these data sample changes within the training set (changing garbage data labels to None) and helps guide human re-annotation of the test set. #### Total Dataset Errors Fixed: 1163 *(6.5% of the dataset)* |Errors / Split. |Overall| Train| Test| |---------------------|------:|---------:|---------:| |Garbage samples fixed| 718| 396| 322| |Empty samples fixed | 445| 254| 254| |Total samples fixed | 1163| 650| 650| To learn more about the process of fixing this dataset, please refer to our [**Blog**](https://www.rungalileo.io/blog). ## Dataset Structure ### Data Instances For each data sample, there is the text of the newsgroup post, the corresponding newsgroup forum where the message was posted (label), and a data sample id. An example from the dataset looks as follows: ``` {'id': 1, 'text': 'I have win 3.0 and downloaded several icons and BMP\'s but I can\'t figure out\nhow to change the "wallpaper" or use the icons. Any help would be appreciated.\n\n\nThanx,\n\n-Brando' 'label': comp.os.ms-windows.misc} ``` ### Data Fields - id: the unique numerical id associated with a data sample - text: a string containing the text of the newsgroups message - label: a string indicating the newsgroup forum where the sample was posted ### Data Splits The data is split into a training and test split. To reduce bias and test generalizability across time, data samples are split between train and test depending upon whether their message was posted before or after a specific date, respectively. ### Data Classes The fixed data is organized into 20 newsgroup topics + a catch all "None" class. Some of the newsgroups are very closely related to each other (e.g. comp.sys.ibm.pc.hardware / comp.sys.mac.hardware), while others are highly unrelated (e.g misc.forsale / soc.religion.christian). Here is a list of the 21 classes, partitioned according to subject matter: | comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x | rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey | sci.crypt <sci.electronics sci.med sci.space | |:---|:---:|---:| | misc.forsale | talk.politics.misc talk.politics.guns talk.politics.mideast | talk.religion.misc alt.atheism soc.religion.christian | | None |

--- annotations_creators: - 众包（crowdsourced） language_creators: - 众包（crowdsourced） language: - 英语（en） license: - 未知（unknown） multilinguality: - 单语言（monolingual） pretty_name: 20_Newsgroups_Fixed size_categories: - 10K<n<100K source_datasets: - 原始数据集（original） task_categories: - 文本分类（text-classification） task_ids: - 多分类（multi-class-classification） - 主题分类（topic-classification） --- # 20_Newsgroups_Fixed 数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集概览](#dataset-summary) - [支持的任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-instances) - [数据划分](#data-instances) - [数据集构建](#dataset-creation) - [数据集构建初衷](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) ## 数据集描述 - **Galileo机器学习数据智能平台（Galileo ML Data Intelligence Platform）主页：[Galileo ML Data Intelligence Platform](https://www.rungalileo.io) - **代码仓库**：[信息待补充] - **数据集博客**：[使用Galileo优化机器学习数据集（第一部分）](https://www.rungalileo.io/blog/) - **排行榜**：[信息待补充] - **联系方式**：[信息待补充] - **scikit-learn数据集（sklearn）**：[sklearn](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html#the-20-newsgroups-text-dataset) - **20新闻组（20 Newsgroups）主页：[newsgroups homepage](http://qwone.com/~jason/20Newsgroups/) ### 数据集概览本数据集是借助Galileo机器学习数据智能平台（Galileo ML Data Intelligence Platform）优化修复后的[20新闻组（20 Newsgroups）](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html#the-20-newsgroups-text-dataset)数据集。Galileo仅需数分钟即可帮助我们发现并修复原始数据集中的大量错误。最终，我们推出此优化后的数据集，作为20新闻组类自然语言实验与基准测试的新标准。 ### 数据集构建初衷本数据集的构建初衷是展示Galileo作为数据智能平台的强大能力。通过Galileo，我们识别出原始20新闻组训练数据集中的关键错误模式——即无法适配任何新闻组标签类别的垃圾数据。此外，我们发现这些错误同样渗透至测试数据集中。基于我们的分析，我们提议新增一个“无（None）”类别，以正确分类并修复垃圾数据样本的标签。Galileo还帮助我们快速完成训练集中的样本修正（将垃圾数据的标签改为None），并辅助指导人工对测试集进行重新标注。 #### 总修复数据集错误数：1163条（占数据集的6.5%） |错误/数据集划分|总计|训练集|测试集| |---------------------|------:|---------:|---------:| |修复垃圾样本数| 718| 396| 322| |修复空样本数 | 445| 254| 254| |总修复样本数 | 1163| 650| 650| 若想了解修复此数据集的完整流程，请参阅我们的[**博客**](https://www.rungalileo.io/blog)。 ## 数据集结构 ### 数据实例每个数据样本包含新闻组帖子的文本、该帖子所属的新闻组论坛（即标签），以及数据样本ID。数据集示例如下： {'id': 1, 'text': 'I have win 3.0 and downloaded several icons and BMP's but I can't figure out how to change the "wallpaper" or use the icons. Any help would be appreciated. Thanx, -Brando' 'label': comp.os.ms-windows.misc} ### 数据字段 - id：数据样本对应的唯一数值ID - text：包含新闻组帖子文本的字符串 - label：表示样本所属新闻组论坛的字符串 ### 数据划分本数据集划分为训练集与测试集。为减少偏差并测试跨时间泛化能力，我们根据样本发布时间是否早于或晚于特定日期，将数据划分为训练集与测试集。 ### 数据类别修复后的数据集包含20个新闻组主题类别 + 一个通用的“无（None）”类别。部分新闻组主题关联性较强（例如comp.sys.ibm.pc.hardware与comp.sys.mac.hardware），而部分主题关联性极弱（例如misc.forsale与soc.religion.christian）。以下为21个类别按主题分类如下： | comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x | rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey | sci.crypt sci.electronics sci.med sci.space | |:---|:---:|---:| | misc.forsale | talk.politics.misc talk.politics.guns talk.politics.mideast | talk.religion.misc alt.atheism soc.religion.christian | | None |

提供机构：

galileo-ai

5,000+

优质数据集

54 个

任务类型

进入经典数据集