bergr7/weakly_supervised_ag_news
收藏Hugging Face2022-10-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/bergr7/weakly_supervised_ag_news
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators: []
language:
- en
language_creators:
- other
license: []
multilinguality:
- monolingual
pretty_name: Weakly supervised AG News Dataset
size_categories:
- 1K<n<10K
source_datasets:
- extended|ag_news
tags: []
task_categories:
- text-classification
task_ids:
- multi-class-classification
---
# Dataset Card for Weakly supervised AG News Dataset
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
### Dataset Summary
AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .
The Weakly supervised AG News Dataset was created by Team 44 of FSDL 2022 course with the only purpose of experimenting with weak supervision techniques. It was assumed that only the labels of the original test set and 20% of the training set were available. The labels in the training set were obtained by creating weak labels with LFs and denoising them with Snorkel's label model.
### Supported Tasks and Leaderboards
[More Information Needed]
### Languages
English
## Dataset Structure
### Data Instances
[More Information Needed]
### Data Fields
text: a string feature
label: a classification label, with possible values including World (0), Sports (1), Business (2), Sci/Tech (3).
### Data Splits
- Training set with probabilistic labels from weak supervision: 37340
- Unlabeled data: 58660
- Validation set: 24000
- Test set: 7600
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
[More Information Needed]
### Citation Information
[More Information Needed]
### Contributions
Thanks to Xiang Zhang (xiang.zhang@nyu.edu) for adding this dataset to the HF Dataset Hub.
提供机构:
bergr7
原始信息汇总
数据集概述
基本信息
- 名称: 弱监督AG新闻数据集(Weakly supervised AG News Dataset)
- 语言: 英语(English)
- 多语言性: 单语(monolingual)
- 规模: 1K<n<10K
- 来源: 扩展自AG新闻数据集
- 任务类别: 文本分类(text-classification)
- 任务ID: 多类分类(multi-class-classification)
数据集描述
数据集总结
- 来源: AG新闻数据集是一个包含超过100万新闻文章的集合,由ComeToMyHead收集。
- 创建目的: 由FSDL 2022课程的Team 44创建,用于实验弱监督技术。
- 标签获取: 使用LFs创建弱标签,并通过Snorkel的标签模型进行去噪。
支持的任务和排行榜
- 信息待补充
语言
- 语言: 英语
数据集结构
数据实例
- 信息待补充
数据字段
- 文本: 字符串特征
- 标签: 分类标签,可能的值包括世界(0)、体育(1)、商业(2)、科学/技术(3)。
数据分割
- 训练集: 37340(带有弱监督的概率标签)
- 未标记数据: 58660
- 验证集: 24000
- 测试集: 7600
数据集创建
数据收集和规范化
- 信息待补充
源语言生产者
- 信息待补充
注释
- 注释过程: 信息待补充
- 注释者: 信息待补充
个人和敏感信息
- 信息待补充
使用数据的考虑
数据集的社会影响
- 信息待补充
偏见的讨论
- 信息待补充
其他已知限制
- 信息待补充
附加信息
数据集管理者
- 信息待补充
许可信息
- 信息待补充
引用信息
- 信息待补充
贡献
- 贡献者: Xiang Zhang (xiang.zhang@nyu.edu)
- 贡献内容: 将此数据集添加到HF数据集中心



