google-research-datasets/natural_questions
收藏数据集卡片:Natural Questions
数据集描述
数据集摘要
Natural Questions 数据集包含真实用户提出的问题,要求问答系统阅读并理解可能包含或不包含答案的整个 Wikipedia 文章。由于包含了真实用户的问题,并且要求解决方案阅读整个页面以找到答案,Natural Questions 比之前的问答数据集更真实、更具挑战性。
支持的任务和排行榜
- 任务类别: 问答
- 任务ID: 开放领域问答
语言
- 英语 (en)
数据集结构
数据实例
以下是一个训练集的示例:
json { "id": "797803103760793766", "document": { "title": "Google", "url": "http://www.wikipedia.org/Google", "html": "<html><body><h1>Google Inc.</h1><p>Google was founded in 1998 By:<ul><li>Larry</li><li>Sergey</li></ul></p></body></html>", "tokens":[ {"token": "<h1>", "start_byte": 12, "end_byte": 16, "is_html": True}, {"token": "Google", "start_byte": 16, "end_byte": 22, "is_html": False}, {"token": "inc", "start_byte": 23, "end_byte": 26, "is_html": False}, {"token": ".", "start_byte": 26, "end_byte": 27, "is_html": False}, {"token": "</h1>", "start_byte": 27, "end_byte": 32, "is_html": True}, {"token": "<p>", "start_byte": 32, "end_byte": 35, "is_html": True}, {"token": "Google", "start_byte": 35, "end_byte": 41, "is_html": False}, {"token": "was", "start_byte": 42, "end_byte": 45, "is_html": False}, {"token": "founded", "start_byte": 46, "end_byte": 53, "is_html": False}, {"token": "in", "start_byte": 54, "end_byte": 56, "is_html": False}, {"token": "1998", "start_byte": 57, "end_byte": 61, "is_html": False}, {"token": "by", "start_byte": 62, "end_byte": 64, "is_html": False}, {"token": ":", "start_byte": 64, "end_byte": 65, "is_html": False}, {"token": "<ul>", "start_byte": 65, "end_byte": 69, "is_html": True}, {"token": "<li>", "start_byte": 69, "end_byte": 73, "is_html": True}, {"token": "Larry", "start_byte": 73, "end_byte": 78, "is_html": False}, {"token": "</li>", "start_byte": 78, "end_byte": 83, "is_html": True}, {"token": "<li>", "start_byte": 83, "end_byte": 87, "is_html": True}, {"token": "Sergey", "start_byte": 87, "end_byte": 92, "is_html": False}, {"token": "</li>", "start_byte": 92, "end_byte": 97, "is_html": True}, {"token": "</ul>", "start_byte": 97, "end_byte": 102, "is_html": True}, {"token": "</p>", "start_byte": 102, "end_byte": 106, "is_html": True} ], }, "question" :{ "text": "who founded google", "tokens": ["who", "founded", "google"] }, "long_answer_candidates": [ {"start_byte": 32, "end_byte": 106, "start_token": 5, "end_token": 22, "top_level": True}, {"start_byte": 65, "end_byte": 102, "start_token": 13, "end_token": 21, "top_level": False}, {"start_byte": 69, "end_byte": 83, "start_token": 14, "end_token": 17, "top_level": False}, {"start_byte": 83, "end_byte": 92, "start_token": 17, "end_token": 20 , "top_level": False} ], "annotations": [{ "id": "6782080525527814293", "long_answer": {"start_byte": 32, "end_byte": 106, "start_token": 5, "end_token": 22, "candidate_index": 0}, "short_answers": [ {"start_byte": 73, "end_byte": 78, "start_token": 15, "end_token": 16, "text": "Larry"}, {"start_byte": 87, "end_byte": 92, "start_token": 18, "end_token": 19, "text": "Sergey"} ], "yes_no_answer": -1 }] }
数据字段
default
id: 字符串特征。document: 包含以下字段的字典特征:title: 字符串特征。url: 字符串特征。html: 字符串特征。tokens: 包含以下字段的字典特征:token: 字符串特征。is_html: 布尔特征。start_byte: 64位整数特征。end_byte: 64位整数特征。
question: 包含以下字段的字典特征:text: 字符串特征。tokens: 字符串列表特征。
long_answer_candidates: 包含以下字段的字典特征:start_token: 64位整数特征。end_token: 64位整数特征。start_byte: 64位整数特征。end_byte: 64位整数特征。top_level: 布尔特征。
annotations: 包含以下字段的字典特征:id: 字符串特征。long_answers: 包含以下字段的字典特征:start_token: 64位整数特征。end_token: 64位整数特征。start_byte: 64位整数特征。end_byte: 64位整数特征。candidate_index: 64位整数特征。
short_answers: 包含以下字段的字典特征:start_token: 64位整数特征。end_token: 64位整数特征。start_byte: 64位整数特征。end_byte: 64位整数特征。text: 字符串特征。
yes_no_answer: 分类标签,可能的值包括NO(0),YES(1)。
数据分割
| 名称 | 训练集 | 验证集 |
|---|---|---|
| default | 307373 | 7830 |
| dev | N/A | 7830 |
数据集创建
策划理由
源数据
初始数据收集和规范化
源语言生产者是谁?
注释
注释过程
注释者是谁?
个人和敏感信息
使用数据的注意事项
数据集的社会影响
偏见的讨论
其他已知限制
附加信息
数据集策展人
许可信息
Creative Commons Attribution-ShareAlike 3.0 Unported
引用信息
bibtex @article{47761, title = {Natural Questions: a Benchmark for Question Answering Research}, author = {Tom Kwiatkowski and Jennimaria Palomaki and Olivia Redfield and Michael Collins and Ankur Parikh and Chris Alberti and Danielle Epstein and Illia Polosukhin and Matthew Kelcey and Jacob Devlin and Kenton Lee and Kristina N. Toutanova and Llion Jones and Ming-Wei Chang and Andrew Dai and Jakob Uszkoreit and Quoc Le and Slav Petrov}, year = {2019}, journal = {Transactions of the Association of Computational Linguistics} }
贡献




