mkqa

Name: mkqa
Creator: maas
Published: 2025-12-05 16:40:31
License: 暂无描述

魔搭社区2025-12-05 更新2025-07-05 收录

下载链接：

https://modelscope.cn/datasets/apple/mkqa

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for MKQA: Multilingual Knowledge Questions & Answers ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - [**Homepage:**](https://github.com/apple/ml-mkqa/) - [**Paper:**](https://arxiv.org/abs/2007.15207) ### Dataset Summary MKQA contains 10,000 queries sampled from the [Google Natural Questions dataset](https://github.com/google-research-datasets/natural-questions). For each query we collect new passage-independent answers. These queries and answers are then human translated into 25 Non-English languages. ### Supported Tasks and Leaderboards `question-answering` ### Languages | Language code | Language name | |---------------|---------------| | `ar` | `Arabic` | | `da` | `Danish` | | `de` | `German` | | `en` | `English` | | `es` | `Spanish` | | `fi` | `Finnish` | | `fr` | `French` | | `he` | `Hebrew` | | `hu` | `Hungarian` | | `it` | `Italian` | | `ja` | `Japanese` | | `ko` | `Korean` | | `km` | `Khmer` | | `ms` | `Malay` | | `nl` | `Dutch` | | `no` | `Norwegian` | | `pl` | `Polish` | | `pt` | `Portuguese` | | `ru` | `Russian` | | `sv` | `Swedish` | | `th` | `Thai` | | `tr` | `Turkish` | | `vi` | `Vietnamese` | | `zh_cn` | `Chinese (Simplified)` | | `zh_hk` | `Chinese (Hong kong)` | | `zh_tw` | `Chinese (Traditional)` | ## Dataset Structure ### Data Instances An example from the data set looks as follows: ``` { 'example_id': 563260143484355911, 'queries': { 'en': "who sings i hear you knocking but you can't come in", 'ru': "кто поет i hear you knocking but you can't come in", 'ja': '「 I hear you knocking」は誰が歌っていますか', 'zh_cn': "《i hear you knocking but you can't come in》是谁演唱的", ... }, 'query': "who sings i hear you knocking but you can't come in", 'answers': {'en': [{'type': 'entity', 'entity': 'Q545186', 'text': 'Dave Edmunds', 'aliases': []}], 'ru': [{'type': 'entity', 'entity': 'Q545186', 'text': 'Эдмундс, Дэйв', 'aliases': ['Эдмундс', 'Дэйв Эдмундс', 'Эдмундс Дэйв', 'Dave Edmunds']}], 'ja': [{'type': 'entity', 'entity': 'Q545186', 'text': 'デイヴ・エドモンズ', 'aliases': ['デーブ・エドモンズ', 'デイブ・エドモンズ']}], 'zh_cn': [{'type': 'entity', 'text': '戴维·埃德蒙兹 ', 'entity': 'Q545186'}], ... }, } ``` ### Data Fields Each example in the dataset contains the unique Natural Questions `example_id`, the original English `query`, and then `queries` and `answers` in 26 languages. Each answer is labelled with an answer type. The breakdown is: | Answer Type | Occurrence | |---------------|---------------| | `entity` | `4221` | | `long_answer` | `1815` | | `unanswerable` | `1427` | | `date` | `1174` | | `number` | `485` | | `number_with_unit` | `394` | | `short_phrase` | `346` | | `binary` | `138` | For each language, there can be more than one acceptable textual answer, in order to capture a variety of possible valid answers. Detailed explanation of fields taken from [here](https://github.com/apple/ml-mkqa/#dataset) when `entity` field is not available it is set to an empty string ''. when `aliases` field is not available it is set to an empty list []. ### Data Splits - Train: 10000 ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data [Google Natural Questions dataset](https://github.com/google-research-datasets/natural-questions) #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [CC BY-SA 3.0](https://github.com/apple/ml-mkqa#license) ### Citation Information ``` @misc{mkqa, title = {MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering}, author = {Shayne Longpre and Yi Lu and Joachim Daiber}, year = {2020}, URL = {https://arxiv.org/pdf/2007.15207.pdf} } ``` ### Contributions Thanks to [@cceyda](https://github.com/cceyda) for adding this dataset.

# MKQA（多语言知识问答）数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与基准排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集构建者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献](#contributions) ## 数据集描述 - [**主页:**](https://github.com/apple/ml-mkqa/) - [**论文:**](https://arxiv.org/abs/2007.15207) ### 数据集概述 MKQA包含从谷歌自然问答（Google Natural Questions）数据集采样得到的10000条查询语句。针对每条查询，我们均采集了独立于段落的全新答案。随后，这些查询与答案被人工翻译成25种非英语语言。 ### 支持任务与基准排行榜 `问答任务（question-answering）` ### 语言 | 语言代码 | 语言名称 | |---------------|---------------| | `ar` | `阿拉伯语（Arabic）` | | `da` | `丹麦语（Danish）` | | `de` | `德语（German）` | | `en` | `英语（English）` | | `es` | `西班牙语（Spanish）` | | `fi` | `芬兰语（Finnish）` | | `fr` | `法语（French）` | | `he` | `希伯来语（Hebrew）` | | `hu` | `匈牙利语（Hungarian）` | | `it` | `意大利语（Italian）` | | `ja` | `日语（Japanese）` | | `ko` | `韩语（Korean）` | | `km` | `高棉语（Khmer）` | | `ms` | `马来语（Malay）` | | `nl` | `荷兰语（Dutch）` | | `no` | `挪威语（Norwegian）` | | `pl` | `波兰语（Polish）` | | `pt` | `葡萄牙语（Portuguese）` | | `ru` | `俄语（Russian）` | | `sv` | `瑞典语（Swedish）` | | `th` | `泰语（Thai）` | | `tr` | `土耳其语（Turkish）` | | `vi` | `越南语（Vietnamese）` | | `zh_cn` | `简体中文（Chinese (Simplified)）` | | `zh_hk` | `香港中文（Chinese (Hong kong)）` | | `zh_tw` | `繁体中文（Chinese (Traditional)）` | ## 数据集结构 ### 数据实例数据集的一则示例格式如下： { 'example_id': 563260143484355911, 'queries': { 'en': "who sings i hear you knocking but you can't come in", 'ru': "кто поет i hear you knocking but you can't come in", 'ja': '「 I hear you knocking」は誰が歌っていますか', 'zh_cn': "《i hear you knocking but you can't come in》是谁演唱的", ... }, 'query': "who sings i hear you knocking but you can't come in", 'answers': {'en': [{'type': 'entity', 'entity': 'Q545186', 'text': 'Dave Edmunds', 'aliases': []}], 'ru': [{'type': 'entity', 'entity': 'Q545186', 'text': 'Эдмундс, Дэйв', 'aliases': ['Эдмундс', 'Дэйв Эдмундс', 'Эдмундс Дэйв', 'Dave Edmunds']}], 'ja': [{'type': 'entity', 'entity': 'Q545186', 'text': 'デイヴ・エドモンズ', 'aliases': ['デーブ・エドモンズ', 'デイブ・エドモンズ']}], 'zh_cn': [{'type': 'entity', 'text': '戴维·埃德蒙兹 ', 'entity': 'Q545186'}], ... }, } ### 数据字段数据集的每一则示例均包含唯一的自然问答（Natural Questions）示例ID、原始英语查询语句，以及覆盖26种语言的`queries`（查询语句）与`answers`（答案）。每条答案均标注了答案类型，各类别分布如下： | 答案类型 | 出现次数 | |---------------|---------------| | `实体（entity）` | `4221` | | `长答案（long_answer）` | `1815` | | `无法回答（unanswerable）` | `1427` | | `日期（date）` | `1174` | | `数字（number）` | `485` | | `带单位数字（number_with_unit）` | `394` | | `短短语（short_phrase）` | `346` | | `二元标签（binary）` | `138` | 针对每种语言，可存在多个合法文本答案，以覆盖多种有效应答场景。字段详细说明摘自[此处](https://github.com/apple/ml-mkqa/#dataset)。当`entity`字段不可用时，其值将被设为空字符串`''`；当`aliases`字段不可用时，其值将被设为空列表`[]`。 ### 数据划分 - 训练集：10000条 ## 数据集构建 ### 构建初衷 [需补充更多信息] ### 源数据 [谷歌自然问答（Google Natural Questions）数据集](https://github.com/google-research-datasets/natural-questions) #### 初始数据采集与归一化 [需补充更多信息] #### 源语言生产者是谁？ [需补充更多信息] ### 标注 #### 标注流程 [需补充更多信息] #### 标注人员是谁？ [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集构建者 [需补充更多信息] ### 许可信息 [知识共享署名-相同方式共享3.0协议（CC BY-SA 3.0）](https://github.com/apple/ml-mkqa#license) ### 引用信息 @misc{mkqa, title = {MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering}, author = {Shayne Longpre and Yi Lu and Joachim Daiber}, year = {2020}, URL = {https://arxiv.org/pdf/2007.15207.pdf} } 其中论文标题翻译为：**MKQA：面向多语言开放域问答的语言多样性基准数据集** ### 贡献感谢[@cceyda](https://github.com/cceyda)添加此数据集。

提供机构：

maas

创建时间：

2025-07-04

搜集汇总

数据集介绍