five

kyunghyuncho/search_qa

收藏
Hugging Face2023-06-16 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/kyunghyuncho/search_qa
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - found language: - en language_creators: - found license: - unknown multilinguality: - monolingual pretty_name: SearchQA size_categories: - 100K<n<1M source_datasets: - original task_categories: - question-answering task_ids: - extractive-qa paperswithcode_id: searchqa dataset_info: - config_name: raw_jeopardy features: - name: category dtype: string - name: air_date dtype: string - name: question dtype: string - name: value dtype: string - name: answer dtype: string - name: round dtype: string - name: show_number dtype: int32 - name: search_results sequence: - name: urls dtype: string - name: snippets dtype: string - name: titles dtype: string - name: related_links dtype: string splits: - name: train num_bytes: 7770972348 num_examples: 216757 download_size: 3314386157 dataset_size: 7770972348 - config_name: train_test_val features: - name: category dtype: string - name: air_date dtype: string - name: question dtype: string - name: value dtype: string - name: answer dtype: string - name: round dtype: string - name: show_number dtype: int32 - name: search_results sequence: - name: urls dtype: string - name: snippets dtype: string - name: titles dtype: string - name: related_links dtype: string splits: - name: train num_bytes: 5303005740 num_examples: 151295 - name: test num_bytes: 1466749978 num_examples: 43228 - name: validation num_bytes: 740962715 num_examples: 21613 download_size: 3148550732 dataset_size: 7510718433 --- # Dataset Card for "search_qa" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Repository:** https://github.com/nyu-dl/dl4ir-searchQA - **Paper:** [SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine](https://arxiv.org/abs/1704.05179) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Size of downloaded dataset files:** 6.46 GB - **Size of the generated dataset:** 15.28 GB - **Total amount of disk used:** 21.74 GB ### Dataset Summary We publicly release a new large-scale dataset, called SearchQA, for machine comprehension, or question-answering. Unlike recently released datasets, such as DeepMind CNN/DailyMail and SQuAD, the proposed SearchQA was constructed to reflect a full pipeline of general question-answering. That is, we start not from an existing article and generate a question-answer pair, but start from an existing question-answer pair, crawled from J! Archive, and augment it with text snippets retrieved by Google. Following this approach, we built SearchQA, which consists of more than 140k question-answer pairs with each pair having 49.6 snippets on average. Each question-answer-context tuple of the SearchQA comes with additional meta-data such as the snippet's URL, which we believe will be valuable resources for future research. We conduct human evaluation as well as test two baseline methods, one simple word selection and the other deep learning based, on the SearchQA. We show that there is a meaningful gap between the human and machine performances. This suggests that the proposed dataset could well serve as a benchmark for question-answering. ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Dataset Structure ### Data Instances #### raw_jeopardy - **Size of downloaded dataset files:** 3.31 GB - **Size of the generated dataset:** 7.77 GB - **Total amount of disk used:** 11.09 GB An example of 'train' looks as follows. ``` ``` #### train_test_val - **Size of downloaded dataset files:** 3.15 GB - **Size of the generated dataset:** 7.51 GB - **Total amount of disk used:** 10.66 GB An example of 'validation' looks as follows. ``` ``` ### Data Fields The data fields are the same among all splits. #### raw_jeopardy - `category`: a `string` feature. - `air_date`: a `string` feature. - `question`: a `string` feature. - `value`: a `string` feature. - `answer`: a `string` feature. - `round`: a `string` feature. - `show_number`: a `int32` feature. - `search_results`: a dictionary feature containing: - `urls`: a `string` feature. - `snippets`: a `string` feature. - `titles`: a `string` feature. - `related_links`: a `string` feature. #### train_test_val - `category`: a `string` feature. - `air_date`: a `string` feature. - `question`: a `string` feature. - `value`: a `string` feature. - `answer`: a `string` feature. - `round`: a `string` feature. - `show_number`: a `int32` feature. - `search_results`: a dictionary feature containing: - `urls`: a `string` feature. - `snippets`: a `string` feature. - `titles`: a `string` feature. - `related_links`: a `string` feature. ### Data Splits #### raw_jeopardy | |train | |------------|-----:| |raw_jeopardy|216757| #### train_test_val | |train |validation|test | |--------------|-----:|---------:|----:| |train_test_val|151295| 21613|43228| ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Citation Information ``` @article{DBLP:journals/corr/DunnSHGCC17, author = {Matthew Dunn and Levent Sagun and Mike Higgins and V. Ugur G{"{u}}ney and Volkan Cirik and Kyunghyun Cho}, title = {SearchQA: {A} New Q{\&}A Dataset Augmented with Context from a Search Engine}, journal = {CoRR}, volume = {abs/1704.05179}, year = {2017}, url = {http://arxiv.org/abs/1704.05179}, archivePrefix = {arXiv}, eprint = {1704.05179}, timestamp = {Mon, 13 Aug 2018 16:47:09 +0200}, biburl = {https://dblp.org/rec/journals/corr/DunnSHGCC17.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } ``` ### Contributions Thanks to [@lewtun](https://github.com/lewtun), [@mariamabarham](https://github.com/mariamabarham), [@lhoestq](https://github.com/lhoestq), [@thomwolf](https://github.com/thomwolf) for adding this dataset.
提供机构:
kyunghyuncho
原始信息汇总

数据集概述

基本信息

  • 名称: SearchQA
  • 语言: 英语(en)
  • 多语言性: 单语
  • 许可证: 未知
  • 规模: 100K<n<1M
  • 源数据: 原始数据
  • 任务类别: 问答(question-answering)
  • 任务ID: extractive-qa
  • 论文代码ID: searchqa

数据集结构

  • 配置名称: raw_jeopardy 和 train_test_val

  • 特征:

    • category: 字符串
    • air_date: 字符串
    • question: 字符串
    • value: 字符串
    • answer: 字符串
    • round: 字符串
    • show_number: 整数(int32)
    • search_results: 字典,包含 urls, snippets, titles, related_links,均为字符串类型
  • 数据分割:

    • raw_jeopardy:
      • train: 216757 个示例,占用 7770972348 字节
    • train_test_val:
      • train: 151295 个示例,占用 5303005740 字节
      • test: 43228 个示例,占用 1466749978 字节
      • validation: 21613 个示例,占用 740962715 字节

数据集大小

  • 下载大小:
    • raw_jeopardy: 3314386157 字节
    • train_test_val: 3148550732 字节
  • 生成数据集大小:
    • raw_jeopardy: 7770972348 字节
    • train_test_val: 7510718433 字节
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作