five

PocketDoc/Retro-YahooAnswers

收藏
Hugging Face2023-12-07 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/PocketDoc/Retro-YahooAnswers
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - question-answering language: - en tags: - not-for-all-audiences - alpaca pretty_name: Retro Yahoo! Answers size_categories: - 1M<n<10M --- ### Description This dataset is an instruct style dataset comprised of a scrape of the Yahoo! Answers website that was done in 2007. The dataset is comprised of 10 categories labeled 1-10. The categories are as follows: 1. Society & Culture 2. Science & Mathematics 3. Health 4. Education & Reference 5. Computers & Internet 6. Sports 7. Business & Finance 8. Entertainment & Music 9. Family & Relationships 10. Politics & Government The subject line and body of the question have been combined into a single field and separated by a newline character. I would caution against using this dataset for any serious application as it contains hilariously out of date information, offensive language, and frequent spelling and grammar errors. It is, however, a charming snapshot of the internet in 2007. **Roughly 228m llama tokens in 1.4m samples** ### Original README >Yahoo! Answers Topic Classification Dataset > >Version 2, Updated 09/09/2015 > > >ORIGIN > >The original Yahoo! Answers corpus can be obtained through the Yahoo! Research Alliance Webscope program. The dataset is to be used for approved non-commercial research purposes by recipients who have signed a Data Sharing Agreement with Yahoo!. The dataset is the Yahoo! Answers corpus as of 10/25/2007. It includes all the questions and their corresponding answers. The corpus contains 4483032 questions and their answers. > >The Yahoo! Answers topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the above dataset. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015). > > >DESCRIPTION > >The Yahoo! Answers topic classification dataset is constructed using 10 largest main categories. Each class contains 140,000 training samples and 6,000 testing samples. Therefore, the total number of training samples is 1,400,000 and testing samples 60,000 in this dataset. From all the answers and other meta-information, we only used the best answer content and the main category information. > >The file classes.txt contains a list of classes corresponding to each label. > >The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 4 columns in them, corresponding to class index (1 to 10), question title, question content and best answer. The text fields are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".
提供机构:
PocketDoc
原始信息汇总

数据集概述

基本信息

  • 任务类别: 问答
  • 语言: 英语
  • 标签: 不适合所有观众, alpaca
  • 名称: Retro Yahoo! Answers
  • 数据量: 1M<n<10M

描述

该数据集是一个指令风格的数据集,包含2007年从Yahoo! Answers网站抓取的数据,分为10个类别,编号为1-10。类别如下:

  1. 社会与文化
  2. 科学与数学
  3. 健康
  4. 教育与参考
  5. 计算机与互联网
  6. 体育
  7. 商业与金融
  8. 娱乐与音乐
  9. 家庭与关系
  10. 政治与政府

问题标题和内容合并为一个字段,并用换行符分隔。该数据集包含过时的信息、冒犯性语言以及频繁的拼写和语法错误,不建议用于严肃应用。

详细信息

  • 数据量: 约228m llama tokens,1.4m样本
  • 原始数据集: Yahoo! Answers主题分类数据集
  • 版本: 2
  • 更新日期: 2015年9月9日
  • 来源: Yahoo! Research Alliance Webscope计划
  • 用途: 仅限非商业研究目的
  • 数据时间: 2007年10月25日
  • 数据量: 4483032个问题及其答案
  • 分类: 使用10个最大的主类别,每个类别包含140,000个训练样本和6,000个测试样本,总计1,400,000个训练样本和60,000个测试样本
  • 文件:
    • classes.txt: 包含每个标签对应的类别列表
    • train.csvtest.csv: 包含所有训练和测试样本,格式为逗号分隔值,共4列,分别为类别索引(1到10)、问题标题、问题内容和最佳答案。文本字段用双引号转义,内部双引号用两个双引号转义,换行符用反斜杠加"n"字符转义。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作