apple/mkqa

Name: apple/mkqa
Creator: apple
Published: 2024-01-18 11:09:04
License: 暂无描述

Hugging Face2024-01-18 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/apple/mkqa

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - crowdsourced language_creators: - found language: - ar - da - de - en - es - fi - fr - he - hu - it - ja - km - ko - ms - nl - 'no' - pl - pt - ru - sv - th - tr - vi - zh license: - cc-by-3.0 multilinguality: - multilingual - translation size_categories: - 10K<n<100K source_datasets: - extended|natural_questions - original task_categories: - question-answering task_ids: - open-domain-qa paperswithcode_id: mkqa pretty_name: Multilingual Knowledge Questions and Answers dataset_info: features: - name: example_id dtype: string - name: queries struct: - name: ar dtype: string - name: da dtype: string - name: de dtype: string - name: en dtype: string - name: es dtype: string - name: fi dtype: string - name: fr dtype: string - name: he dtype: string - name: hu dtype: string - name: it dtype: string - name: ja dtype: string - name: ko dtype: string - name: km dtype: string - name: ms dtype: string - name: nl dtype: string - name: 'no' dtype: string - name: pl dtype: string - name: pt dtype: string - name: ru dtype: string - name: sv dtype: string - name: th dtype: string - name: tr dtype: string - name: vi dtype: string - name: zh_cn dtype: string - name: zh_hk dtype: string - name: zh_tw dtype: string - name: query dtype: string - name: answers struct: - name: ar list: - name: type dtype: class_label: names: '0': entity '1': long_answer '2': unanswerable '3': date '4': number '5': number_with_unit '6': short_phrase '7': binary - name: entity dtype: string - name: text dtype: string - name: aliases list: string - name: da list: - name: type dtype: class_label: names: '0': entity '1': long_answer '2': unanswerable '3': date '4': number '5': number_with_unit '6': short_phrase '7': binary - name: entity dtype: string - name: text dtype: string - name: aliases list: string - name: de list: - name: type dtype: class_label: names: '0': entity '1': long_answer '2': unanswerable '3': date '4': number '5': number_with_unit '6': short_phrase '7': binary - name: entity dtype: string - name: text dtype: string - name: aliases list: string - name: en list: - name: type dtype: class_label: names: '0': entity '1': long_answer '2': unanswerable '3': date '4': number '5': number_with_unit '6': short_phrase '7': binary - name: entity dtype: string - name: text dtype: string - name: aliases list: string - name: es list: - name: type dtype: class_label: names: '0': entity '1': long_answer '2': unanswerable '3': date '4': number '5': number_with_unit '6': short_phrase '7': binary - name: entity dtype: string - name: text dtype: string - name: aliases list: string - name: fi list: - name: type dtype: class_label: names: '0': entity '1': long_answer '2': unanswerable '3': date '4': number '5': number_with_unit '6': short_phrase '7': binary - name: entity dtype: string - name: text dtype: string - name: aliases list: string - name: fr list: - name: type dtype: class_label: names: '0': entity '1': long_answer '2': unanswerable '3': date '4': number '5': number_with_unit '6': short_phrase '7': binary - name: entity dtype: string - name: text dtype: string - name: aliases list: string - name: he list: - name: type dtype: class_label: names: '0': entity '1': long_answer '2': unanswerable '3': date '4': number '5': number_with_unit '6': short_phrase '7': binary - name: entity dtype: string - name: text dtype: string - name: aliases list: string - name: hu list: - name: type dtype: class_label: names: '0': entity '1': long_answer '2': unanswerable '3': date '4': number '5': number_with_unit '6': short_phrase '7': binary - name: entity dtype: string - name: text dtype: string - name: aliases list: string - name: it list: - name: type dtype: class_label: names: '0': entity '1': long_answer '2': unanswerable '3': date '4': number '5': number_with_unit '6': short_phrase '7': binary - name: entity dtype: string - name: text dtype: string - name: aliases list: string - name: ja list: - name: type dtype: class_label: names: '0': entity '1': long_answer '2': unanswerable '3': date '4': number '5': number_with_unit '6': short_phrase '7': binary - name: entity dtype: string - name: text dtype: string - name: aliases list: string - name: ko list: - name: type dtype: class_label: names: '0': entity '1': long_answer '2': unanswerable '3': date '4': number '5': number_with_unit '6': short_phrase '7': binary - name: entity dtype: string - name: text dtype: string - name: aliases list: string - name: km list: - name: type dtype: class_label: names: '0': entity '1': long_answer '2': unanswerable '3': date '4': number '5': number_with_unit '6': short_phrase '7': binary - name: entity dtype: string - name: text dtype: string - name: aliases list: string - name: ms list: - name: type dtype: class_label: names: '0': entity '1': long_answer '2': unanswerable '3': date '4': number '5': number_with_unit '6': short_phrase '7': binary - name: entity dtype: string - name: text dtype: string - name: aliases list: string - name: nl list: - name: type dtype: class_label: names: '0': entity '1': long_answer '2': unanswerable '3': date '4': number '5': number_with_unit '6': short_phrase '7': binary - name: entity dtype: string - name: text dtype: string - name: aliases list: string - name: 'no' list: - name: type dtype: class_label: names: '0': entity '1': long_answer '2': unanswerable '3': date '4': number '5': number_with_unit '6': short_phrase '7': binary - name: entity dtype: string - name: text dtype: string - name: aliases list: string - name: pl list: - name: type dtype: class_label: names: '0': entity '1': long_answer '2': unanswerable '3': date '4': number '5': number_with_unit '6': short_phrase '7': binary - name: entity dtype: string - name: text dtype: string - name: aliases list: string - name: pt list: - name: type dtype: class_label: names: '0': entity '1': long_answer '2': unanswerable '3': date '4': number '5': number_with_unit '6': short_phrase '7': binary - name: entity dtype: string - name: text dtype: string - name: aliases list: string - name: ru list: - name: type dtype: class_label: names: '0': entity '1': long_answer '2': unanswerable '3': date '4': number '5': number_with_unit '6': short_phrase '7': binary - name: entity dtype: string - name: text dtype: string - name: aliases list: string - name: sv list: - name: type dtype: class_label: names: '0': entity '1': long_answer '2': unanswerable '3': date '4': number '5': number_with_unit '6': short_phrase '7': binary - name: entity dtype: string - name: text dtype: string - name: aliases list: string - name: th list: - name: type dtype: class_label: names: '0': entity '1': long_answer '2': unanswerable '3': date '4': number '5': number_with_unit '6': short_phrase '7': binary - name: entity dtype: string - name: text dtype: string - name: aliases list: string - name: tr list: - name: type dtype: class_label: names: '0': entity '1': long_answer '2': unanswerable '3': date '4': number '5': number_with_unit '6': short_phrase '7': binary - name: entity dtype: string - name: text dtype: string - name: aliases list: string - name: vi list: - name: type dtype: class_label: names: '0': entity '1': long_answer '2': unanswerable '3': date '4': number '5': number_with_unit '6': short_phrase '7': binary - name: entity dtype: string - name: text dtype: string - name: aliases list: string - name: zh_cn list: - name: type dtype: class_label: names: '0': entity '1': long_answer '2': unanswerable '3': date '4': number '5': number_with_unit '6': short_phrase '7': binary - name: entity dtype: string - name: text dtype: string - name: aliases list: string - name: zh_hk list: - name: type dtype: class_label: names: '0': entity '1': long_answer '2': unanswerable '3': date '4': number '5': number_with_unit '6': short_phrase '7': binary - name: entity dtype: string - name: text dtype: string - name: aliases list: string - name: zh_tw list: - name: type dtype: class_label: names: '0': entity '1': long_answer '2': unanswerable '3': date '4': number '5': number_with_unit '6': short_phrase '7': binary - name: entity dtype: string - name: text dtype: string - name: aliases list: string config_name: mkqa splits: - name: train num_bytes: 36005650 num_examples: 10000 download_size: 11903948 dataset_size: 36005650 --- # Dataset Card for MKQA: Multilingual Knowledge Questions & Answers ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - [**Homepage:**](https://github.com/apple/ml-mkqa/) - [**Paper:**](https://arxiv.org/abs/2007.15207) ### Dataset Summary MKQA contains 10,000 queries sampled from the [Google Natural Questions dataset](https://github.com/google-research-datasets/natural-questions). For each query we collect new passage-independent answers. These queries and answers are then human translated into 25 Non-English languages. ### Supported Tasks and Leaderboards `question-answering` ### Languages | Language code | Language name | |---------------|---------------| | `ar` | `Arabic` | | `da` | `Danish` | | `de` | `German` | | `en` | `English` | | `es` | `Spanish` | | `fi` | `Finnish` | | `fr` | `French` | | `he` | `Hebrew` | | `hu` | `Hungarian` | | `it` | `Italian` | | `ja` | `Japanese` | | `ko` | `Korean` | | `km` | `Khmer` | | `ms` | `Malay` | | `nl` | `Dutch` | | `no` | `Norwegian` | | `pl` | `Polish` | | `pt` | `Portuguese` | | `ru` | `Russian` | | `sv` | `Swedish` | | `th` | `Thai` | | `tr` | `Turkish` | | `vi` | `Vietnamese` | | `zh_cn` | `Chinese (Simplified)` | | `zh_hk` | `Chinese (Hong kong)` | | `zh_tw` | `Chinese (Traditional)` | ## Dataset Structure ### Data Instances An example from the data set looks as follows: ``` { 'example_id': 563260143484355911, 'queries': { 'en': "who sings i hear you knocking but you can't come in", 'ru': "кто поет i hear you knocking but you can't come in", 'ja': '「 I hear you knocking」は誰が歌っていますか', 'zh_cn': "《i hear you knocking but you can't come in》是谁演唱的", ... }, 'query': "who sings i hear you knocking but you can't come in", 'answers': {'en': [{'type': 'entity', 'entity': 'Q545186', 'text': 'Dave Edmunds', 'aliases': []}], 'ru': [{'type': 'entity', 'entity': 'Q545186', 'text': 'Эдмундс, Дэйв', 'aliases': ['Эдмундс', 'Дэйв Эдмундс', 'Эдмундс Дэйв', 'Dave Edmunds']}], 'ja': [{'type': 'entity', 'entity': 'Q545186', 'text': 'デイヴ・エドモンズ', 'aliases': ['デーブ・エドモンズ', 'デイブ・エドモンズ']}], 'zh_cn': [{'type': 'entity', 'text': '戴维·埃德蒙兹 ', 'entity': 'Q545186'}], ... }, } ``` ### Data Fields Each example in the dataset contains the unique Natural Questions `example_id`, the original English `query`, and then `queries` and `answers` in 26 languages. Each answer is labelled with an answer type. The breakdown is: | Answer Type | Occurrence | |---------------|---------------| | `entity` | `4221` | | `long_answer` | `1815` | | `unanswerable` | `1427` | | `date` | `1174` | | `number` | `485` | | `number_with_unit` | `394` | | `short_phrase` | `346` | | `binary` | `138` | For each language, there can be more than one acceptable textual answer, in order to capture a variety of possible valid answers. Detailed explanation of fields taken from [here](https://github.com/apple/ml-mkqa/#dataset) when `entity` field is not available it is set to an empty string ''. when `aliases` field is not available it is set to an empty list []. ### Data Splits - Train: 10000 ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data [Google Natural Questions dataset](https://github.com/google-research-datasets/natural-questions) #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [CC BY-SA 3.0](https://github.com/apple/ml-mkqa#license) ### Citation Information ``` @misc{mkqa, title = {MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering}, author = {Shayne Longpre and Yi Lu and Joachim Daiber}, year = {2020}, URL = {https://arxiv.org/pdf/2007.15207.pdf} } ``` ### Contributions Thanks to [@cceyda](https://github.com/cceyda) for adding this dataset.

提供机构：

apple

原始信息汇总

数据集概述

基本信息

数据集名称: Multilingual Knowledge Questions and Answers (MKQA)
数据集ID: mkqa
数据集类型: 多语言问答数据集
数据集大小: 10K<n<100K
语言: 26种语言，包括阿拉伯语、丹麦语、德语、英语、西班牙语、芬兰语、法语、希伯来语、匈牙利语、意大利语、日语、韩语、高棉语、马来语、荷兰语、挪威语、波兰语、葡萄牙语、俄语、瑞典语、泰语、土耳其语、越南语、简体中文、繁体中文（香港）、繁体中文（台湾）
许可证: CC BY-3.0

数据集来源

源数据集: 扩展自Google Natural Questions数据集

任务类型

任务类别: 问答
任务ID: 开放领域问答 (open-domain-qa)

数据集结构

特征:
- example_id: 字符串类型，示例ID
- queries: 结构体，包含26种语言的查询
- query: 字符串类型，原始英语查询
- answers: 结构体，包含26种语言的答案，每个答案包含类型、实体、文本和别名

数据分割

训练集: 10000个示例

答案类型分布

答案类型	出现次数
`entity`	`4221`
`long_answer`	`1815`
`unanswerable`	`1427`
`date`	`1174`
`number`	`485`
`number_with_unit`	`394`
`short_phrase`	`346`
`binary`	`138`

数据集示例

json { "example_id": 563260143484355911, "queries": { "en": "who sings i hear you knocking but you cant come in", "ru": "кто поет i hear you knocking but you cant come in", "ja": "「 I hear you knocking」は誰が歌っていますか", "zh_cn": "《i hear you knocking but you cant come in》是谁演唱的", ... }, "query": "who sings i hear you knocking but you cant come in", "answers": { "en": [ { "type": "entity", "entity": "Q545186", "text": "Dave Edmunds", "aliases": [] } ], "ru": [ { "type": "entity", "entity": "Q545186", "text": "Эдмундс, Дэйв", "aliases": ["Эдмундс", "Дэйв Эдмундс", "Эдмундс Дэйв", "Dave Edmunds"] } ], "ja": [ { "type": "entity", "entity": "Q545186", "text": "デイヴ・エドモンズ", "aliases": ["デーブ・エドモンズ", "デイブ・エドモンズ"] } ], "zh_cn": [ { "type": "entity", "text": "戴维·埃德蒙兹 ", "entity": "Q545186" } ], ... } }

引用信息

bibtex @misc{mkqa, title = {MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering}, author = {Shayne Longpre and Yi Lu and Joachim Daiber}, year = {2020}, URL = {https://arxiv.org/pdf/2007.15207.pdf} }

搜集汇总

数据集介绍

构建方式

在开放域问答研究领域，构建高质量的多语言数据集对于评估模型跨语言理解能力至关重要。MKQA数据集以谷歌自然问题数据集为蓝本，精心筛选出10,000个知识性问题作为核心语料。每个问题的答案均经过人工重新标注，确保其独立于原始检索段落，从而提升答案的准确性与普适性。随后，这些英语问答对通过专业人工翻译，精准转化为涵盖阿拉伯语、中文、日语等26种语言的平行文本，形成了规模适中、语言多样性丰富的多语言问答资源。

特点

该数据集在跨语言问答任务中展现出鲜明的语言学价值。其核心特征在于覆盖了26种语言，包括简体中文、繁体中文及粤语等变体，为多语言模型评估提供了广泛的语言谱系样本。每个答案均标注了详细的类型标签，如实体、日期、数字等八种范畴，并可能包含多个可接受的文本表述及别名列表，以捕捉自然语言表达的多样性。数据集规模约一万条，在保证质量的同时，便于进行高效的模型训练与评测，其结构化设计支持对多语言问答性能进行细粒度分析。

使用方法

研究人员可利用该数据集对多语言开放域问答模型进行基准测试与性能评估。通过HuggingFace数据集库加载后，可直接访问各语言对应的查询文本与结构化答案。典型应用包括：评估模型在不同语言上的问答准确率，分析模型对实体、日期等不同答案类型的处理能力，以及探究跨语言知识迁移的效果。数据集的标准化格式便于集成至现有训练流程，支持针对特定语言或任务进行子集筛选，为推进多语言自然语言理解研究提供可靠的数据支撑。

背景与挑战

背景概述

在自然语言处理领域，开放域问答系统的发展长期受限于高质量多语言数据资源的匮乏。为应对这一挑战，苹果公司研究团队于2020年推出了MKQA多语言知识问答数据集，该数据集基于谷歌自然问题数据集构建，通过人工翻译将一万条英文查询及其答案扩展至涵盖阿拉伯语、中文、日语等二十六种语言版本。这项开创性工作旨在为跨语言知识检索模型提供标准化评估基准，其精心设计的答案类型标注体系涵盖实体、日期、数值等八种语义类别，显著推动了多语言语义理解技术的演进，并为全球化智能助手的研发奠定了数据基础。

当前挑战

开放域问答系统在多语言场景下面临着语言结构差异性与文化语境复杂性的双重挑战，具体表现为翻译过程中的语义损耗可能导致答案与原始语境脱节，而不同语言对实体指称的多样性表达则增加了跨语言对齐难度。在数据集构建层面，人工翻译二十六种语言需要克服专业术语一致性维护与语言资源不均衡分布等难题，同时确保十万余条跨语言答案在类型标注、实体链接等维度保持语义等价性，这对质量控制机制提出了极高要求。此外，如何平衡低资源语言的数据覆盖度与高资源语言的深度标注，仍是亟待解决的核心问题。

常用场景

经典使用场景

在跨语言信息检索与问答系统研究领域，MKQA数据集以其涵盖26种语言的平行问答对，成为评估多语言开放域问答模型性能的基准工具。该数据集通过从自然问题数据集中采样并人工翻译，构建了语言间知识对齐的桥梁，使得研究者能够系统检验模型在不同语言环境下的泛化能力与知识迁移效果。其经典应用场景聚焦于多语言预训练模型的微调与评测，为跨语言语义理解提供了标准化的实验平台。

衍生相关工作

围绕MKQA数据集衍生的经典研究包括多语言稠密检索模型mDPR、跨语言生成式问答框架mT5的优化，以及基于对比学习的语言对齐方法XLM-R的扩展工作。这些研究通过利用MKQA的多语言平行结构，探索了知识蒸馏、提示学习等技术在跨语言任务中的应用，推动了如Language-agnostic BERT、InfoXLM等模型的演进，为后续构建大规模多语言知识图谱奠定了方法论基础。

数据集最近研究