sh0416/ag_news
收藏Hugging Face2023-02-23 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/sh0416/ag_news
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- text-classification
language:
- en
---
AG's News Topic Classification Dataset
Version 3, Updated 09/09/2015
ORIGIN
AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .
The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).
DESCRIPTION
The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.
The file classes.txt contains a list of classes corresponding to each label.
The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 3 columns in them, corresponding to class index (1 to 4), title and description. The title and description are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".
CLASS NAME INFORMATION
1: World
2: Sports
3: Business
4: Sci/Tech
JSONL FORMAT
Instead of preserving csv format, I change the format to jsonl, which doesn't consider complicated rule about doublequote and escaping.
提供机构:
sh0416
原始信息汇总
AGs News Topic Classification Dataset
基本信息
- 任务类别: 文本分类
- 语言: 英语
数据集概述
- 版本: 3
- 更新日期: 2015年09月09日
- 来源: 由ComeToMyHead收集,包含超过100万篇新闻文章,来自2000多个新闻源。
- 用途: 用于数据挖掘、信息检索、XML、数据压缩、数据流等非商业研究活动。
数据集构建
- 构建者: Xiang Zhang (xiang.zhang@nyu.edu)
- 参考文献: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).
- 分类详情: 从原始语料库中选择了4个最大的类别,每个类别包含30,000个训练样本和1,900个测试样本。
- 样本总数: 训练样本120,000个,测试样本7,600个。
数据文件
- classes.txt: 包含与每个标签对应的类别列表。
- train.csv: 包含所有训练样本,格式为逗号分隔值,包含3列:类别索引(1至4)、标题和描述。
- test.csv: 包含所有测试样本,格式与train.csv相同。
类别名称
- 1: World
- 2: Sports
- 3: Business
- 4: Sci/Tech
数据格式
- 原始格式: CSV
- 转换格式: JSONL,简化了双引号和转义规则的处理。
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



