five

mkqa

收藏
魔搭社区2025-12-05 更新2025-07-05 收录
下载链接:
https://modelscope.cn/datasets/apple/mkqa
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for MKQA: Multilingual Knowledge Questions & Answers ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - [**Homepage:**](https://github.com/apple/ml-mkqa/) - [**Paper:**](https://arxiv.org/abs/2007.15207) ### Dataset Summary MKQA contains 10,000 queries sampled from the [Google Natural Questions dataset](https://github.com/google-research-datasets/natural-questions). For each query we collect new passage-independent answers. These queries and answers are then human translated into 25 Non-English languages. ### Supported Tasks and Leaderboards `question-answering` ### Languages | Language code | Language name | |---------------|---------------| | `ar` | `Arabic` | | `da` | `Danish` | | `de` | `German` | | `en` | `English` | | `es` | `Spanish` | | `fi` | `Finnish` | | `fr` | `French` | | `he` | `Hebrew` | | `hu` | `Hungarian` | | `it` | `Italian` | | `ja` | `Japanese` | | `ko` | `Korean` | | `km` | `Khmer` | | `ms` | `Malay` | | `nl` | `Dutch` | | `no` | `Norwegian` | | `pl` | `Polish` | | `pt` | `Portuguese` | | `ru` | `Russian` | | `sv` | `Swedish` | | `th` | `Thai` | | `tr` | `Turkish` | | `vi` | `Vietnamese` | | `zh_cn` | `Chinese (Simplified)` | | `zh_hk` | `Chinese (Hong kong)` | | `zh_tw` | `Chinese (Traditional)` | ## Dataset Structure ### Data Instances An example from the data set looks as follows: ``` { 'example_id': 563260143484355911, 'queries': { 'en': "who sings i hear you knocking but you can't come in", 'ru': "кто поет i hear you knocking but you can't come in", 'ja': '「 I hear you knocking」は誰が歌っていますか', 'zh_cn': "《i hear you knocking but you can't come in》是谁演唱的", ... }, 'query': "who sings i hear you knocking but you can't come in", 'answers': {'en': [{'type': 'entity', 'entity': 'Q545186', 'text': 'Dave Edmunds', 'aliases': []}], 'ru': [{'type': 'entity', 'entity': 'Q545186', 'text': 'Эдмундс, Дэйв', 'aliases': ['Эдмундс', 'Дэйв Эдмундс', 'Эдмундс Дэйв', 'Dave Edmunds']}], 'ja': [{'type': 'entity', 'entity': 'Q545186', 'text': 'デイヴ・エドモンズ', 'aliases': ['デーブ・エドモンズ', 'デイブ・エドモンズ']}], 'zh_cn': [{'type': 'entity', 'text': '戴维·埃德蒙兹 ', 'entity': 'Q545186'}], ... }, } ``` ### Data Fields Each example in the dataset contains the unique Natural Questions `example_id`, the original English `query`, and then `queries` and `answers` in 26 languages. Each answer is labelled with an answer type. The breakdown is: | Answer Type | Occurrence | |---------------|---------------| | `entity` | `4221` | | `long_answer` | `1815` | | `unanswerable` | `1427` | | `date` | `1174` | | `number` | `485` | | `number_with_unit` | `394` | | `short_phrase` | `346` | | `binary` | `138` | For each language, there can be more than one acceptable textual answer, in order to capture a variety of possible valid answers. Detailed explanation of fields taken from [here](https://github.com/apple/ml-mkqa/#dataset) when `entity` field is not available it is set to an empty string ''. when `aliases` field is not available it is set to an empty list []. ### Data Splits - Train: 10000 ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data [Google Natural Questions dataset](https://github.com/google-research-datasets/natural-questions) #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [CC BY-SA 3.0](https://github.com/apple/ml-mkqa#license) ### Citation Information ``` @misc{mkqa, title = {MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering}, author = {Shayne Longpre and Yi Lu and Joachim Daiber}, year = {2020}, URL = {https://arxiv.org/pdf/2007.15207.pdf} } ``` ### Contributions Thanks to [@cceyda](https://github.com/cceyda) for adding this dataset.

# MKQA(多语言知识问答)数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与基准排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集构建者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献](#contributions) ## 数据集描述 - [**主页:**](https://github.com/apple/ml-mkqa/) - [**论文:**](https://arxiv.org/abs/2007.15207) ### 数据集概述 MKQA包含从谷歌自然问答(Google Natural Questions)数据集采样得到的10000条查询语句。针对每条查询,我们均采集了独立于段落的全新答案。随后,这些查询与答案被人工翻译成25种非英语语言。 ### 支持任务与基准排行榜 `问答任务(question-answering)` ### 语言 | 语言代码 | 语言名称 | |---------------|---------------| | `ar` | `阿拉伯语(Arabic)` | | `da` | `丹麦语(Danish)` | | `de` | `德语(German)` | | `en` | `英语(English)` | | `es` | `西班牙语(Spanish)` | | `fi` | `芬兰语(Finnish)` | | `fr` | `法语(French)` | | `he` | `希伯来语(Hebrew)` | | `hu` | `匈牙利语(Hungarian)` | | `it` | `意大利语(Italian)` | | `ja` | `日语(Japanese)` | | `ko` | `韩语(Korean)` | | `km` | `高棉语(Khmer)` | | `ms` | `马来语(Malay)` | | `nl` | `荷兰语(Dutch)` | | `no` | `挪威语(Norwegian)` | | `pl` | `波兰语(Polish)` | | `pt` | `葡萄牙语(Portuguese)` | | `ru` | `俄语(Russian)` | | `sv` | `瑞典语(Swedish)` | | `th` | `泰语(Thai)` | | `tr` | `土耳其语(Turkish)` | | `vi` | `越南语(Vietnamese)` | | `zh_cn` | `简体中文(Chinese (Simplified))` | | `zh_hk` | `香港中文(Chinese (Hong kong))` | | `zh_tw` | `繁体中文(Chinese (Traditional))` | ## 数据集结构 ### 数据实例 数据集的一则示例格式如下: { 'example_id': 563260143484355911, 'queries': { 'en': "who sings i hear you knocking but you can't come in", 'ru': "кто поет i hear you knocking but you can't come in", 'ja': '「 I hear you knocking」は誰が歌っていますか', 'zh_cn': "《i hear you knocking but you can't come in》是谁演唱的", ... }, 'query': "who sings i hear you knocking but you can't come in", 'answers': {'en': [{'type': 'entity', 'entity': 'Q545186', 'text': 'Dave Edmunds', 'aliases': []}], 'ru': [{'type': 'entity', 'entity': 'Q545186', 'text': 'Эдмундс, Дэйв', 'aliases': ['Эдмундс', 'Дэйв Эдмундс', 'Эдмундс Дэйв', 'Dave Edmunds']}], 'ja': [{'type': 'entity', 'entity': 'Q545186', 'text': 'デイヴ・エドモンズ', 'aliases': ['デーブ・エドモンズ', 'デイブ・エドモンズ']}], 'zh_cn': [{'type': 'entity', 'text': '戴维·埃德蒙兹 ', 'entity': 'Q545186'}], ... }, } ### 数据字段 数据集的每一则示例均包含唯一的自然问答(Natural Questions)示例ID、原始英语查询语句,以及覆盖26种语言的`queries`(查询语句)与`answers`(答案)。每条答案均标注了答案类型,各类别分布如下: | 答案类型 | 出现次数 | |---------------|---------------| | `实体(entity)` | `4221` | | `长答案(long_answer)` | `1815` | | `无法回答(unanswerable)` | `1427` | | `日期(date)` | `1174` | | `数字(number)` | `485` | | `带单位数字(number_with_unit)` | `394` | | `短短语(short_phrase)` | `346` | | `二元标签(binary)` | `138` | 针对每种语言,可存在多个合法文本答案,以覆盖多种有效应答场景。 字段详细说明摘自[此处](https://github.com/apple/ml-mkqa/#dataset)。 当`entity`字段不可用时,其值将被设为空字符串`''`;当`aliases`字段不可用时,其值将被设为空列表`[]`。 ### 数据划分 - 训练集:10000条 ## 数据集构建 ### 构建初衷 [需补充更多信息] ### 源数据 [谷歌自然问答(Google Natural Questions)数据集](https://github.com/google-research-datasets/natural-questions) #### 初始数据采集与归一化 [需补充更多信息] #### 源语言生产者是谁? [需补充更多信息] ### 标注 #### 标注流程 [需补充更多信息] #### 标注人员是谁? [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集构建者 [需补充更多信息] ### 许可信息 [知识共享署名-相同方式共享3.0协议(CC BY-SA 3.0)](https://github.com/apple/ml-mkqa#license) ### 引用信息 @misc{mkqa, title = {MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering}, author = {Shayne Longpre and Yi Lu and Joachim Daiber}, year = {2020}, URL = {https://arxiv.org/pdf/2007.15207.pdf} } 其中论文标题翻译为:**MKQA:面向多语言开放域问答的语言多样性基准数据集** ### 贡献 感谢[@cceyda](https://github.com/cceyda)添加此数据集。
提供机构:
maas
创建时间:
2025-07-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作