LLukas22/NLQuAD
收藏Hugging Face2022-12-23 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/LLukas22/NLQuAD
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: NLQuAD
language:
- en
license:
- cc-by-3.0
size_categories:
- 10K<n<100K
multilinguality:
- monolingual
task_ids:
- extractive-qa
dataset_info:
features:
- name: title
dtype: string
- name: date
dtype: string
- name: paragraphs
list:
- name: context
dtype: string
- name: qas
list:
- name: answers
list:
- name: answer_end
dtype: int64
- name: answer_start
dtype: int64
- name: text
dtype: string
- name: id
dtype: string
- name: question
dtype: string
splits:
- name: train
num_bytes: 72036724
num_examples: 10259
- name: test
num_bytes: 9045482
num_examples: 1280
- name: validation
num_bytes: 8876137
num_examples: 1280
download_size: 0
dataset_size: 89958343
---
# Dataset Card for "NLQuAD"
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Additional Information](#additional-information)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
## Dataset Description
- **Homepage:** [https://github.com/ASoleimaniB/NLQuAD](https://github.com/ASoleimaniB/NLQuAD)
- **Paper: https://aclanthology.org/2021.eacl-main.106/**
- **Size of the generated dataset:** 89.95 MB
### Dataset Summary
This is a copy of the original NLQuAD dataset distributed via [Github](https://github.com/ASoleimaniB/NLQuAD).
NLQuAD is a non-factoid long question answering dataset from BBC news articles.
NLQuAD’s question types and the long length of its context documents as well as answers, make it a challenging real-world task.
NLQuAD consists of news articles as context documents, interrogative sub-headings in the articles as questions, and body paragraphs corresponding to the sub-headings as contiguous answers to the questions.
NLQuAD contains 31k non-factoid questions and long answers collected from 13k BBC news articles.
See example articles in BBC [1](https://www.bbc.com/news/world-asia-china-51230011), [2](https://www.bbc.com/news/world-55709428).
We automatically extract target answers because annotating for non-factoid long QA is extremely challenging and costly.
## Dataset Structure
### Data Instances
An example of 'train' looks as follows.
```json
{
"title": "Khashoggi murder: Body 'dissolved in acid'",
"date": "2 November 2018",
"paragraphs":[
{
"context": "A top Turkish official, presidential adviser Yasin Aktay, has said ....",
"qas":[
{
"question":"What was said in the crown prince's alleged phone call?",
"id":"0_0",
"answers":[
{
"text":"During the call with President Donald Trump\'s son-in-law Jared Kushner and national ....",
"answer_start":1352,
"answer_end": 2108,
}
]
},
{
"question":"What has the investigation found so far?",
"id":"0_1",
"answers":[
{
"text":"There is still no consensus on how Khashoggi died. He entered ....",
"answer_start":2109,
"answer_end": 3128,
}
]
},
]
}
]
}
```
### Data Fields
The data fields are the same among all splits.
- `title`: a `string` feature.
- `date`: a `string` feature.
- `paragraphs`: a list feature containing dictionaries:
- `context`: a `string` feature.
- `qas`: a list feature containing dictionaries:
- `question`: a `string` feature.
- `id`: a `string` feature.
- `answers`: a list feature containing dictionaries:
- `text`: a `string` feature.
- `answer_start`: a `int32` feature.
- `answer_end`: a `int32` feature
### Data Splits
| name |train|test|validation|
|----------|----:|----:|---------:|
| |10259| 1280| 1280|
## Additional Information
### Licensing Information
This dataset is distributed under the [CC BY-NC](https://creativecommons.org/licenses/by-nc/3.0/) licence providing free access for non-commercial and academic usage.
### Citation Information
BibTeX:
```json
@inproceedings{soleimani-etal-2021-nlquad,
title = "{NLQ}u{AD}: A Non-Factoid Long Question Answering Data Set",
author = "Soleimani, Amir and
Monz, Christof and
Worring, Marcel",
booktitle = "Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume",
month = apr,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.eacl-main.106",
doi = "10.18653/v1/2021.eacl-main.106",
pages = "1245--1255",
abstract = "We introduce NLQuAD, the first data set with baseline methods for non-factoid long question answering, a task requiring document-level language understanding. In contrast to existing span detection question answering data sets, NLQuAD has non-factoid questions that are not answerable by a short span of text and demanding multiple-sentence descriptive answers and opinions. We show the limitation of the F1 score for evaluation of long answers and introduce Intersection over Union (IoU), which measures position-sensitive overlap between the predicted and the target answer spans. To establish baseline performances, we compare BERT, RoBERTa, and Longformer models. Experimental results and human evaluations show that Longformer outperforms the other architectures, but results are still far behind a human upper bound, leaving substantial room for improvements. NLQuAD{'}s samples exceed the input limitation of most pre-trained Transformer-based models, encouraging future research on long sequence language models.",
}
```
提供机构:
LLukas22
原始信息汇总
数据集概述
数据集名称
- 名称: NLQuAD
语言
- 语言: 英语 (en)
许可证
- 许可证: CC-BY-3.0
数据集大小
- 大小类别: 10K<n<100K
- 生成数据集大小: 89.95 MB
多语言性
- 多语言性: 单语种
任务类型
- 任务ID: extractive-qa
数据集结构
数据实例
- 示例: 包含标题、日期和段落,每个段落包含上下文和问题答案对。
数据字段
- 标题 (string): 文章标题。
- 日期 (string): 文章日期。
- 段落 (list): 包含以下字段:
- 上下文 (string): 段落内容。
- 问题答案对 (list): 包含以下字段:
- 问题 (string): 问题文本。
- ID (string): 问题ID。
- 答案 (list): 包含以下字段:
- 文本 (string): 答案文本。
- 答案开始 (int64): 答案在文本中的开始位置。
- 答案结束 (int64): 答案在文本中的结束位置。
数据分割
- 训练集: 10259个实例,大小为72036724字节。
- 测试集: 1280个实例,大小为9045482字节。
- 验证集: 1280个实例,大小为8876137字节。
数据集详情
- 数据集描述: NLQuAD是一个非事实性长问题回答数据集,源自BBC新闻文章,包含31k非事实性问题和长答案,收集自13k BBC新闻文章。
- 数据集结构: 数据集包含新闻文章作为上下文文档,文章中的疑问副标题作为问题,以及与副标题对应的正文段落作为连续答案。
- 数据集特点: 由于非事实性长QA的标注极其挑战且成本高,目标答案是自动提取的。



