five

LLukas22/NLQuAD

收藏
Hugging Face2022-12-23 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/LLukas22/NLQuAD
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: NLQuAD language: - en license: - cc-by-3.0 size_categories: - 10K<n<100K multilinguality: - monolingual task_ids: - extractive-qa dataset_info: features: - name: title dtype: string - name: date dtype: string - name: paragraphs list: - name: context dtype: string - name: qas list: - name: answers list: - name: answer_end dtype: int64 - name: answer_start dtype: int64 - name: text dtype: string - name: id dtype: string - name: question dtype: string splits: - name: train num_bytes: 72036724 num_examples: 10259 - name: test num_bytes: 9045482 num_examples: 1280 - name: validation num_bytes: 8876137 num_examples: 1280 download_size: 0 dataset_size: 89958343 --- # Dataset Card for "NLQuAD" ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Additional Information](#additional-information) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) ## Dataset Description - **Homepage:** [https://github.com/ASoleimaniB/NLQuAD](https://github.com/ASoleimaniB/NLQuAD) - **Paper: https://aclanthology.org/2021.eacl-main.106/** - **Size of the generated dataset:** 89.95 MB ### Dataset Summary This is a copy of the original NLQuAD dataset distributed via [Github](https://github.com/ASoleimaniB/NLQuAD). NLQuAD is a non-factoid long question answering dataset from BBC news articles. NLQuAD’s question types and the long length of its context documents as well as answers, make it a challenging real-world task. NLQuAD consists of news articles as context documents, interrogative sub-headings in the articles as questions, and body paragraphs corresponding to the sub-headings as contiguous answers to the questions. NLQuAD contains 31k non-factoid questions and long answers collected from 13k BBC news articles. See example articles in BBC [1](https://www.bbc.com/news/world-asia-china-51230011), [2](https://www.bbc.com/news/world-55709428). We automatically extract target answers because annotating for non-factoid long QA is extremely challenging and costly. ## Dataset Structure ### Data Instances An example of 'train' looks as follows. ```json { "title": "Khashoggi murder: Body 'dissolved in acid'", "date": "2 November 2018", "paragraphs":[ { "context": "A top Turkish official, presidential adviser Yasin Aktay, has said ....", "qas":[ { "question":"What was said in the crown prince's alleged phone call?", "id":"0_0", "answers":[ { "text":"During the call with President Donald Trump\'s son-in-law Jared Kushner and national ....", "answer_start":1352, "answer_end": 2108, } ] }, { "question":"What has the investigation found so far?", "id":"0_1", "answers":[ { "text":"There is still no consensus on how Khashoggi died. He entered ....", "answer_start":2109, "answer_end": 3128, } ] }, ] } ] } ``` ### Data Fields The data fields are the same among all splits. - `title`: a `string` feature. - `date`: a `string` feature. - `paragraphs`: a list feature containing dictionaries: - `context`: a `string` feature. - `qas`: a list feature containing dictionaries: - `question`: a `string` feature. - `id`: a `string` feature. - `answers`: a list feature containing dictionaries: - `text`: a `string` feature. - `answer_start`: a `int32` feature. - `answer_end`: a `int32` feature ### Data Splits | name |train|test|validation| |----------|----:|----:|---------:| | |10259| 1280| 1280| ## Additional Information ### Licensing Information This dataset is distributed under the [CC BY-NC](https://creativecommons.org/licenses/by-nc/3.0/) licence providing free access for non-commercial and academic usage. ### Citation Information BibTeX: ```json @inproceedings{soleimani-etal-2021-nlquad, title = "{NLQ}u{AD}: A Non-Factoid Long Question Answering Data Set", author = "Soleimani, Amir and Monz, Christof and Worring, Marcel", booktitle = "Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume", month = apr, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.eacl-main.106", doi = "10.18653/v1/2021.eacl-main.106", pages = "1245--1255", abstract = "We introduce NLQuAD, the first data set with baseline methods for non-factoid long question answering, a task requiring document-level language understanding. In contrast to existing span detection question answering data sets, NLQuAD has non-factoid questions that are not answerable by a short span of text and demanding multiple-sentence descriptive answers and opinions. We show the limitation of the F1 score for evaluation of long answers and introduce Intersection over Union (IoU), which measures position-sensitive overlap between the predicted and the target answer spans. To establish baseline performances, we compare BERT, RoBERTa, and Longformer models. Experimental results and human evaluations show that Longformer outperforms the other architectures, but results are still far behind a human upper bound, leaving substantial room for improvements. NLQuAD{'}s samples exceed the input limitation of most pre-trained Transformer-based models, encouraging future research on long sequence language models.", } ```
提供机构:
LLukas22
原始信息汇总

数据集概述

数据集名称

  • 名称: NLQuAD

语言

  • 语言: 英语 (en)

许可证

  • 许可证: CC-BY-3.0

数据集大小

  • 大小类别: 10K<n<100K
  • 生成数据集大小: 89.95 MB

多语言性

  • 多语言性: 单语种

任务类型

  • 任务ID: extractive-qa

数据集结构

数据实例

  • 示例: 包含标题、日期和段落,每个段落包含上下文和问题答案对。

数据字段

  • 标题 (string): 文章标题。
  • 日期 (string): 文章日期。
  • 段落 (list): 包含以下字段:
    • 上下文 (string): 段落内容。
    • 问题答案对 (list): 包含以下字段:
      • 问题 (string): 问题文本。
      • ID (string): 问题ID。
      • 答案 (list): 包含以下字段:
        • 文本 (string): 答案文本。
        • 答案开始 (int64): 答案在文本中的开始位置。
        • 答案结束 (int64): 答案在文本中的结束位置。

数据分割

  • 训练集: 10259个实例,大小为72036724字节。
  • 测试集: 1280个实例,大小为9045482字节。
  • 验证集: 1280个实例,大小为8876137字节。

数据集详情

  • 数据集描述: NLQuAD是一个非事实性长问题回答数据集,源自BBC新闻文章,包含31k非事实性问题和长答案,收集自13k BBC新闻文章。
  • 数据集结构: 数据集包含新闻文章作为上下文文档,文章中的疑问副标题作为问题,以及与副标题对应的正文段落作为连续答案。
  • 数据集特点: 由于非事实性长QA的标注极其挑战且成本高,目标答案是自动提取的。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作