mkqa
收藏魔搭社区2025-12-05 更新2025-07-05 收录
下载链接:
https://modelscope.cn/datasets/apple/mkqa
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for MKQA: Multilingual Knowledge Questions & Answers
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- [**Homepage:**](https://github.com/apple/ml-mkqa/)
- [**Paper:**](https://arxiv.org/abs/2007.15207)
### Dataset Summary
MKQA contains 10,000 queries sampled from the [Google Natural Questions dataset](https://github.com/google-research-datasets/natural-questions).
For each query we collect new passage-independent answers.
These queries and answers are then human translated into 25 Non-English languages.
### Supported Tasks and Leaderboards
`question-answering`
### Languages
| Language code | Language name |
|---------------|---------------|
| `ar` | `Arabic` |
| `da` | `Danish` |
| `de` | `German` |
| `en` | `English` |
| `es` | `Spanish` |
| `fi` | `Finnish` |
| `fr` | `French` |
| `he` | `Hebrew` |
| `hu` | `Hungarian` |
| `it` | `Italian` |
| `ja` | `Japanese` |
| `ko` | `Korean` |
| `km` | `Khmer` |
| `ms` | `Malay` |
| `nl` | `Dutch` |
| `no` | `Norwegian` |
| `pl` | `Polish` |
| `pt` | `Portuguese` |
| `ru` | `Russian` |
| `sv` | `Swedish` |
| `th` | `Thai` |
| `tr` | `Turkish` |
| `vi` | `Vietnamese` |
| `zh_cn` | `Chinese (Simplified)` |
| `zh_hk` | `Chinese (Hong kong)` |
| `zh_tw` | `Chinese (Traditional)` |
## Dataset Structure
### Data Instances
An example from the data set looks as follows:
```
{
'example_id': 563260143484355911,
'queries': {
'en': "who sings i hear you knocking but you can't come in",
'ru': "кто поет i hear you knocking but you can't come in",
'ja': '「 I hear you knocking」は誰が歌っていますか',
'zh_cn': "《i hear you knocking but you can't come in》是谁演唱的",
...
},
'query': "who sings i hear you knocking but you can't come in",
'answers': {'en': [{'type': 'entity',
'entity': 'Q545186',
'text': 'Dave Edmunds',
'aliases': []}],
'ru': [{'type': 'entity',
'entity': 'Q545186',
'text': 'Эдмундс, Дэйв',
'aliases': ['Эдмундс', 'Дэйв Эдмундс', 'Эдмундс Дэйв', 'Dave Edmunds']}],
'ja': [{'type': 'entity',
'entity': 'Q545186',
'text': 'デイヴ・エドモンズ',
'aliases': ['デーブ・エドモンズ', 'デイブ・エドモンズ']}],
'zh_cn': [{'type': 'entity', 'text': '戴维·埃德蒙兹 ', 'entity': 'Q545186'}],
...
},
}
```
### Data Fields
Each example in the dataset contains the unique Natural Questions `example_id`, the original English `query`, and then `queries` and `answers` in 26 languages.
Each answer is labelled with an answer type. The breakdown is:
| Answer Type | Occurrence |
|---------------|---------------|
| `entity` | `4221` |
| `long_answer` | `1815` |
| `unanswerable` | `1427` |
| `date` | `1174` |
| `number` | `485` |
| `number_with_unit` | `394` |
| `short_phrase` | `346` |
| `binary` | `138` |
For each language, there can be more than one acceptable textual answer, in order to capture a variety of possible valid answers.
Detailed explanation of fields taken from [here](https://github.com/apple/ml-mkqa/#dataset)
when `entity` field is not available it is set to an empty string ''.
when `aliases` field is not available it is set to an empty list [].
### Data Splits
- Train: 10000
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
[Google Natural Questions dataset](https://github.com/google-research-datasets/natural-questions)
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
[CC BY-SA 3.0](https://github.com/apple/ml-mkqa#license)
### Citation Information
```
@misc{mkqa,
title = {MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering},
author = {Shayne Longpre and Yi Lu and Joachim Daiber},
year = {2020},
URL = {https://arxiv.org/pdf/2007.15207.pdf}
}
```
### Contributions
Thanks to [@cceyda](https://github.com/cceyda) for adding this dataset.
# MKQA(多语言知识问答)数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集概述](#dataset-summary)
- [支持任务与基准排行榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建初衷](#curation-rationale)
- [源数据](#source-data)
- [标注](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集构建者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献](#contributions)
## 数据集描述
- [**主页:**](https://github.com/apple/ml-mkqa/)
- [**论文:**](https://arxiv.org/abs/2007.15207)
### 数据集概述
MKQA包含从谷歌自然问答(Google Natural Questions)数据集采样得到的10000条查询语句。针对每条查询,我们均采集了独立于段落的全新答案。随后,这些查询与答案被人工翻译成25种非英语语言。
### 支持任务与基准排行榜
`问答任务(question-answering)`
### 语言
| 语言代码 | 语言名称 |
|---------------|---------------|
| `ar` | `阿拉伯语(Arabic)` |
| `da` | `丹麦语(Danish)` |
| `de` | `德语(German)` |
| `en` | `英语(English)` |
| `es` | `西班牙语(Spanish)` |
| `fi` | `芬兰语(Finnish)` |
| `fr` | `法语(French)` |
| `he` | `希伯来语(Hebrew)` |
| `hu` | `匈牙利语(Hungarian)` |
| `it` | `意大利语(Italian)` |
| `ja` | `日语(Japanese)` |
| `ko` | `韩语(Korean)` |
| `km` | `高棉语(Khmer)` |
| `ms` | `马来语(Malay)` |
| `nl` | `荷兰语(Dutch)` |
| `no` | `挪威语(Norwegian)` |
| `pl` | `波兰语(Polish)` |
| `pt` | `葡萄牙语(Portuguese)` |
| `ru` | `俄语(Russian)` |
| `sv` | `瑞典语(Swedish)` |
| `th` | `泰语(Thai)` |
| `tr` | `土耳其语(Turkish)` |
| `vi` | `越南语(Vietnamese)` |
| `zh_cn` | `简体中文(Chinese (Simplified))` |
| `zh_hk` | `香港中文(Chinese (Hong kong))` |
| `zh_tw` | `繁体中文(Chinese (Traditional))` |
## 数据集结构
### 数据实例
数据集的一则示例格式如下:
{
'example_id': 563260143484355911,
'queries': {
'en': "who sings i hear you knocking but you can't come in",
'ru': "кто поет i hear you knocking but you can't come in",
'ja': '「 I hear you knocking」は誰が歌っていますか',
'zh_cn': "《i hear you knocking but you can't come in》是谁演唱的",
...
},
'query': "who sings i hear you knocking but you can't come in",
'answers': {'en': [{'type': 'entity',
'entity': 'Q545186',
'text': 'Dave Edmunds',
'aliases': []}],
'ru': [{'type': 'entity',
'entity': 'Q545186',
'text': 'Эдмундс, Дэйв',
'aliases': ['Эдмундс', 'Дэйв Эдмундс', 'Эдмундс Дэйв', 'Dave Edmunds']}],
'ja': [{'type': 'entity',
'entity': 'Q545186',
'text': 'デイヴ・エドモンズ',
'aliases': ['デーブ・エドモンズ', 'デイブ・エドモンズ']}],
'zh_cn': [{'type': 'entity', 'text': '戴维·埃德蒙兹 ', 'entity': 'Q545186'}],
...
},
}
### 数据字段
数据集的每一则示例均包含唯一的自然问答(Natural Questions)示例ID、原始英语查询语句,以及覆盖26种语言的`queries`(查询语句)与`answers`(答案)。每条答案均标注了答案类型,各类别分布如下:
| 答案类型 | 出现次数 |
|---------------|---------------|
| `实体(entity)` | `4221` |
| `长答案(long_answer)` | `1815` |
| `无法回答(unanswerable)` | `1427` |
| `日期(date)` | `1174` |
| `数字(number)` | `485` |
| `带单位数字(number_with_unit)` | `394` |
| `短短语(short_phrase)` | `346` |
| `二元标签(binary)` | `138` |
针对每种语言,可存在多个合法文本答案,以覆盖多种有效应答场景。
字段详细说明摘自[此处](https://github.com/apple/ml-mkqa/#dataset)。
当`entity`字段不可用时,其值将被设为空字符串`''`;当`aliases`字段不可用时,其值将被设为空列表`[]`。
### 数据划分
- 训练集:10000条
## 数据集构建
### 构建初衷
[需补充更多信息]
### 源数据
[谷歌自然问答(Google Natural Questions)数据集](https://github.com/google-research-datasets/natural-questions)
#### 初始数据采集与归一化
[需补充更多信息]
#### 源语言生产者是谁?
[需补充更多信息]
### 标注
#### 标注流程
[需补充更多信息]
#### 标注人员是谁?
[需补充更多信息]
### 个人与敏感信息
[需补充更多信息]
## 数据集使用注意事项
### 数据集的社会影响
[需补充更多信息]
### 偏差讨论
[需补充更多信息]
### 其他已知局限性
[需补充更多信息]
## 附加信息
### 数据集构建者
[需补充更多信息]
### 许可信息
[知识共享署名-相同方式共享3.0协议(CC BY-SA 3.0)](https://github.com/apple/ml-mkqa#license)
### 引用信息
@misc{mkqa,
title = {MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering},
author = {Shayne Longpre and Yi Lu and Joachim Daiber},
year = {2020},
URL = {https://arxiv.org/pdf/2007.15207.pdf}
}
其中论文标题翻译为:**MKQA:面向多语言开放域问答的语言多样性基准数据集**
### 贡献
感谢[@cceyda](https://github.com/cceyda)添加此数据集。
提供机构:
maas
创建时间:
2025-07-04



