five

galileo-ai/20_Newsgroups_Fixed

收藏
Hugging Face2022-10-25 更新2025-07-05 收录
下载链接:
https://hf-mirror.com/datasets/galileo-ai/20_Newsgroups_Fixed
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - crowdsourced language_creators: - crowdsourced language: - en license: - unknown multilinguality: - monolingual pretty_name: 20_Newsgroups_Fixed size_categories: - 10K<n<100K source_datasets: - original task_categories: - text-classification task_ids: - multi-class-classification - topic-classification --- # Dataset Card for 20_Newsgroups_Fixed ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-instances) - [Data Splits](#data-instances) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) ## Dataset Description - **Galileo Homepage:** [Galileo ML Data Intelligence Platform](https://www.rungalileo.io) - **Repository:** [Needs More Information] - **Dataset Blog:** [Improving Your ML Datasets With Galileo, Part 1](https://www.rungalileo.io/blog/) - **Leaderboard:** [Needs More Information] - **Point of Contact:** [Needs More Information] - **Sklearn Dataset:** [sklearn](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html#the-20-newsgroups-text-dataset) - **20 Newsgroups Homepage:** [newsgroups homepage](http://qwone.com/~jason/20Newsgroups/) ### Dataset Summary This dataset is a version of the [**20 Newsgroups**](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html#the-20-newsgroups-text-dataset) dataset fixed with the help of the [**Galileo ML Data Intelligence Platform**](https://www.rungalileo.io/). In a matter of minutes, Galileo enabled us to uncover and fix a multitude of errors within the original dataset. In the end, we present this improved dataset as a new standard for natural language experimentation and benchmarking using the Newsgroups dataset. ### Curation Rationale This dataset was created to showcase the power of Galileo as a Data Intelligence Platform. Through Galileo, we identify critical error patterns within the original Newsgroups training dataset - garbage data that do not properly fit any newsgroup label category. Moreover, we observe that these errors permeate throughout the test dataset. As a result of our analysis, we propose the addition of a new class to properly categorize and fix the labeling of garbage data samples: a "None" class. Galileo further enables us to quickly make these data sample changes within the training set (changing garbage data labels to None) and helps guide human re-annotation of the test set. #### Total Dataset Errors Fixed: 1163 *(6.5% of the dataset)* |Errors / Split. |Overall| Train| Test| |---------------------|------:|---------:|---------:| |Garbage samples fixed| 718| 396| 322| |Empty samples fixed | 445| 254| 254| |Total samples fixed | 1163| 650| 650| To learn more about the process of fixing this dataset, please refer to our [**Blog**](https://www.rungalileo.io/blog). ## Dataset Structure ### Data Instances For each data sample, there is the text of the newsgroup post, the corresponding newsgroup forum where the message was posted (label), and a data sample id. An example from the dataset looks as follows: ``` {'id': 1, 'text': 'I have win 3.0 and downloaded several icons and BMP\'s but I can\'t figure out\nhow to change the "wallpaper" or use the icons. Any help would be appreciated.\n\n\nThanx,\n\n-Brando' 'label': comp.os.ms-windows.misc} ``` ### Data Fields - id: the unique numerical id associated with a data sample - text: a string containing the text of the newsgroups message - label: a string indicating the newsgroup forum where the sample was posted ### Data Splits The data is split into a training and test split. To reduce bias and test generalizability across time, data samples are split between train and test depending upon whether their message was posted before or after a specific date, respectively. ### Data Classes The fixed data is organized into 20 newsgroup topics + a catch all "None" class. Some of the newsgroups are very closely related to each other (e.g. comp.sys.ibm.pc.hardware / comp.sys.mac.hardware), while others are highly unrelated (e.g misc.forsale / soc.religion.christian). Here is a list of the 21 classes, partitioned according to subject matter: | comp.graphics<br>comp.os.ms-windows.misc<br>comp.sys.ibm.pc.hardware<br>comp.sys.mac.hardware<br>comp.windows.x | rec.autos<br>rec.motorcycles<br>rec.sport.baseball<br>rec.sport.hockey | sci.crypt<br><sci.electronics<br>sci.med<br>sci.space | |:---|:---:|---:| | misc.forsale | talk.politics.misc<br>talk.politics.guns<br>talk.politics.mideast | talk.religion.misc<br>alt.atheism<br>soc.religion.christian | | None |

--- annotations_creators: - 众包(crowdsourced) language_creators: - 众包(crowdsourced) language: - 英语(en) license: - 未知(unknown) multilinguality: - 单语言(monolingual) pretty_name: 20_Newsgroups_Fixed size_categories: - 10K<n<100K source_datasets: - 原始数据集(original) task_categories: - 文本分类(text-classification) task_ids: - 多分类(multi-class-classification) - 主题分类(topic-classification) --- # 20_Newsgroups_Fixed 数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集概览](#dataset-summary) - [支持的任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-instances) - [数据划分](#data-instances) - [数据集构建](#dataset-creation) - [数据集构建初衷](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) ## 数据集描述 - **Galileo机器学习数据智能平台(Galileo ML Data Intelligence Platform)主页:[Galileo ML Data Intelligence Platform](https://www.rungalileo.io) - **代码仓库**:[信息待补充] - **数据集博客**:[使用Galileo优化机器学习数据集(第一部分)](https://www.rungalileo.io/blog/) - **排行榜**:[信息待补充] - **联系方式**:[信息待补充] - **scikit-learn数据集(sklearn)**:[sklearn](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html#the-20-newsgroups-text-dataset) - **20新闻组(20 Newsgroups)主页:[newsgroups homepage](http://qwone.com/~jason/20Newsgroups/) ### 数据集概览 本数据集是借助Galileo机器学习数据智能平台(Galileo ML Data Intelligence Platform)优化修复后的[20新闻组(20 Newsgroups)](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html#the-20-newsgroups-text-dataset)数据集。Galileo仅需数分钟即可帮助我们发现并修复原始数据集中的大量错误。最终,我们推出此优化后的数据集,作为20新闻组类自然语言实验与基准测试的新标准。 ### 数据集构建初衷 本数据集的构建初衷是展示Galileo作为数据智能平台的强大能力。通过Galileo,我们识别出原始20新闻组训练数据集中的关键错误模式——即无法适配任何新闻组标签类别的垃圾数据。此外,我们发现这些错误同样渗透至测试数据集中。 基于我们的分析,我们提议新增一个“无(None)”类别,以正确分类并修复垃圾数据样本的标签。Galileo还帮助我们快速完成训练集中的样本修正(将垃圾数据的标签改为None),并辅助指导人工对测试集进行重新标注。 #### 总修复数据集错误数:1163条(占数据集的6.5%) |错误/数据集划分|总计|训练集|测试集| |---------------------|------:|---------:|---------:| |修复垃圾样本数| 718| 396| 322| |修复空样本数 | 445| 254| 254| |总修复样本数 | 1163| 650| 650| 若想了解修复此数据集的完整流程,请参阅我们的[**博客**](https://www.rungalileo.io/blog)。 ## 数据集结构 ### 数据实例 每个数据样本包含新闻组帖子的文本、该帖子所属的新闻组论坛(即标签),以及数据样本ID。 数据集示例如下: {'id': 1, 'text': 'I have win 3.0 and downloaded several icons and BMP's but I can't figure out how to change the "wallpaper" or use the icons. Any help would be appreciated. Thanx, -Brando' 'label': comp.os.ms-windows.misc} ### 数据字段 - id:数据样本对应的唯一数值ID - text:包含新闻组帖子文本的字符串 - label:表示样本所属新闻组论坛的字符串 ### 数据划分 本数据集划分为训练集与测试集。为减少偏差并测试跨时间泛化能力,我们根据样本发布时间是否早于或晚于特定日期,将数据划分为训练集与测试集。 ### 数据类别 修复后的数据集包含20个新闻组主题类别 + 一个通用的“无(None)”类别。部分新闻组主题关联性较强(例如comp.sys.ibm.pc.hardware与comp.sys.mac.hardware),而部分主题关联性极弱(例如misc.forsale与soc.religion.christian)。以下为21个类别按主题分类如下: | comp.graphics<br>comp.os.ms-windows.misc<br>comp.sys.ibm.pc.hardware<br>comp.sys.mac.hardware<br>comp.windows.x | rec.autos<br>rec.motorcycles<br>rec.sport.baseball<br>rec.sport.hockey | sci.crypt<br>sci.electronics<br>sci.med<br>sci.space | |:---|:---:|---:| | misc.forsale | talk.politics.misc<br>talk.politics.guns<br>talk.politics.mideast | talk.religion.misc<br>alt.atheism<br>soc.religion.christian | | None |
提供机构:
galileo-ai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作