unicamp-dl/mmarco

Name: unicamp-dl/mmarco
Creator: unicamp-dl
Published: 2024-03-06 20:49:39
License: 暂无描述

Hugging Face2024-03-06 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/unicamp-dl/mmarco

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Summary **mMARCO** is a multilingual version of the [MS MARCO passage ranking dataset](https://microsoft.github.io/msmarco/). For more information, checkout our papers: * [**mMARCO: A Multilingual Version of the MS MARCO Passage Ranking Dataset**](https://arxiv.org/abs/2108.13897) * [**A cost-benefit analysis of cross-lingual transfer methods**](https://arxiv.org/abs/2105.06813) The first (deprecated) version comprises 8 languages: Chinese, French, German, Indonesian, Italian, Portuguese, Russian and Spanish. The current version included translations for Japanese, Dutch, Vietnamese, Hindi and Arabic. The current version is composed of 14 languages (including the original English version). ### Supported languages | Language name | Language code | |---------------|---------------| | English | english | | Chinese | chinese | | French | french | | German | german | | Indonesian | indonesian | | Italian | italian | | Portuguese | portuguese | | Russian | russian | | Spanish | spanish | | Arabic | arabic | | Dutch | dutch | | Hindi | hindi | | Japanese | japanese | | Vietnamese | vietnamese | # Dataset Structure You can load mMARCO dataset by choosing a specific language. We include training triples (query, positive and negative example), the translated collections of documents and queries. #### Training triples ```python >>> dataset = load_dataset('unicamp-dl/mmarco', 'english') >>> dataset['train'][1] {'query': 'what fruit is native to australia', 'positive': 'Passiflora herbertiana. A rare passion fruit native to Australia. Fruits are green-skinned, white fleshed, with an unknown edible rating. Some sources list the fruit as edible, sweet and tasty, while others list the fruits as being bitter and inedible.assiflora herbertiana. A rare passion fruit native to Australia. Fruits are green-skinned, white fleshed, with an unknown edible rating. Some sources list the fruit as edible, sweet and tasty, while others list the fruits as being bitter and inedible.', 'negative': 'The kola nut is the fruit of the kola tree, a genus (Cola) of trees that are native to the tropical rainforests of Africa.'} ``` #### Queries ```python >>> dataset = load_dataset('unicamp-dl/mmarco', 'queries-spanish') >>> dataset['train'][1] {'id': 634306, 'text': '¿Qué significa Chattel en el historial de crédito'} ``` #### Collection ```python >>> dataset = load_dataset('unicamp-dl/mmarco', 'collection-portuguese') >>> dataset['collection'][100] {'id': 100, 'text': 'Antonín Dvorák (1841-1904) Antonin Dvorak era filho de açougueiro, mas ele não seguiu o negócio de seu pai. Enquanto ajudava seu pai a meio tempo, estudou música e se formou na Escola de Órgãos de Praga em 1859.'} ``` ### Licensing Information This dataset is released under [Apache license 2.0](https://www.apache.org/licenses/). # Citation Information ``` @article{DBLP:journals/corr/abs-2108-13897, author = {Luiz Bonifacio and Israel Campiotti and Roberto de Alencar Lotufo and Rodrigo Frassetto Nogueira}, title = {mMARCO: {A} Multilingual Version of {MS} {MARCO} Passage Ranking Dataset}, journal = {CoRR}, volume = {abs/2108.13897}, year = {2021}, url = {https://arxiv.org/abs/2108.13897}, eprinttype = {arXiv}, eprint = {2108.13897}, timestamp = {Mon, 20 Mar 2023 15:35:34 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-2108-13897.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } ```

# 数据集概述 **mMARCO** 是[MS MARCO段落排序数据集](https://microsoft.github.io/msmarco/)的多语言版本。如需了解更多细节，请参阅以下论文： * [**mMARCO：MS MARCO段落排序数据集的多语言版本**](https://arxiv.org/abs/2108.13897) * [**跨语言迁移方法的成本效益分析**](https://arxiv.org/abs/2105.06813) 首个（已废弃）版本包含8种语言：中文、法语、德语、印尼语、意大利语、葡萄牙语、俄语与西班牙语。当前版本新增了日语、荷兰语、越南语、印地语与阿拉伯语的翻译内容，整体涵盖14种语言（包含原始英文版本）。 ### 支持的语言 | 语言名称 | 语言代码 | |---------------|---------------| | 英语 | english | | 中文 | chinese | | 法语 | french | | 德语 | german | | 印尼语 | indonesian | | 意大利语 | italian | | 葡萄牙语 | portuguese | | 俄语 | russian | | 西班牙语 | spanish | | 阿拉伯语 | arabic | | 荷兰语 | dutch | | 印地语 | hindi | | 日语 | japanese | | 越南语 | vietnamese | # 数据集结构您可通过选择特定语言加载mMARCO数据集。本数据集包含训练三元组（查询、正样本与负样本），以及经过翻译的文档集合与查询文本。 #### 训练三元组 python >>> dataset = load_dataset('unicamp-dl/mmarco', 'english') >>> dataset['train'][1] {'query': 'what fruit is native to australia', 'positive': 'Passiflora herbertiana. A rare passion fruit native to Australia. Fruits are green-skinned, white fleshed, with an unknown edible rating. Some sources list the fruit as edible, sweet and tasty, while others list the fruits as being bitter and inedible.assiflora herbertiana. A rare passion fruit native to Australia. Fruits are green-skinned, white fleshed, with an unknown edible rating. Some sources list the fruit as edible, sweet and tasty, while others list the fruits as being bitter and inedible.', 'negative': 'The kola nut is the fruit of the kola tree, a genus (Cola) of trees that are native to the tropical rainforests of Africa.'} #### 查询样本 python >>> dataset = load_dataset('unicamp-dl/mmarco', 'queries-spanish') >>> dataset['train'][1] {'id': 634306, 'text': '¿Qué significa Chattel en el historial de crédito'} #### 文档集合 python >>> dataset = load_dataset('unicamp-dl/mmarco', 'collection-portuguese') >>> dataset['collection'][100] {'id': 100, 'text': 'Antonín Dvorák (1841-1904) Antonin Dvorak era filho de açougueiro, mas ele não seguiu o negócio de seu pai. Enquanto ajudava seu pai a meio tempo, estudou música e se formou na Escola de Órgãos de Praga em 1859.'} ### 授权信息本数据集基于[Apache许可证2.0](https://www.apache.org/licenses/)发布。 # 引用信息 @article{DBLP:journals/corr/abs-2108-13897, author = {Luiz Bonifacio and Israel Campiotti and Roberto de Alencar Lotufo and Rodrigo Frassetto Nogueira}, title = {mMARCO: {A} Multilingual Version of {MS} {MARCO} Passage Ranking Dataset}, journal = {CoRR}, volume = {abs/2108.13897}, year = {2021}, url = {https://arxiv.org/abs/2108.13897}, eprinttype = {arXiv}, eprint = {2108.13897}, timestamp = {Mon, 20 Mar 2023 15:35:34 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-2108-13897.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

提供机构：

unicamp-dl

原始信息汇总

数据集概述

mMARCO 是一个多语言版本的 MS MARCO 段落排序数据集。该数据集支持14种语言，包括英语、中文、法语、德语、印度尼西亚语、意大利语、葡萄牙语、俄语、西班牙语、阿拉伯语、荷兰语、印地语、日语和越南语。

数据集结构

mMARCO 数据集可以通过选择特定语言来加载，包含训练三元组（查询、正例和负例）、翻译的文档集合和查询。

训练三元组：包括查询、正例和负例的示例。
查询：特定语言的查询示例。
集合：翻译的文档集合示例。

许可证信息

该数据集根据 Apache 许可证 2.0 发布。

引用信息

@article{DBLP:journals/corr/abs-2108-13897, author = {Luiz Bonifacio and Israel Campiotti and Roberto de Alencar Lotufo and Rodrigo Frassetto Nogueira}, title = {mMARCO: {A} Multilingual Version of {MS} {MARCO} Passage Ranking Dataset}, journal = {CoRR}, volume = {abs/2108.13897}, year = {2021}, url = {https://arxiv.org/abs/2108.13897}, eprinttype = {arXiv}, eprint = {2108.13897}, timestamp = {Mon, 20 Mar 2023 15:35:34 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-2108-13897.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

搜集汇总

数据集介绍

构建方式

mMARCO数据集的构建，是在MS MARCO篇章排名数据集的基础上进行的跨国语言扩展。该数据集通过精心选取并翻译原文档和查询，形成了包含训练三元组（查询、正例和反例）的结构。其涵盖了14种语言版本，包括英语、中文、法语、德语等，不仅沿袭了原有英语版的数据结构，还实现了多语言环境下的数据互操作性。

特点

mMARCO数据集的主要特点在于其多语言性，为研究者提供了一个跨语言信息检索的实验平台。它包含了多种语言的数据，使得跨语言检索技术能够在不同语言之间进行有效迁移。此外，数据集遵循Apache 2.0许可，保证了数据的开放性与共享性，有利于促进学术界的交流与合作。

使用方法

使用mMARCO数据集时，用户可以根据需要选择特定语言版本。数据集提供了训练三元组、翻译后的文档集合和查询，可以通过相应的API加载。例如，加载英语训练数据只需调用`load_dataset('unicamp-dl/mmarco', 'english')`。同时，数据集的使用也遵循相应的许可协议，确保了数据的合法合规使用。

背景与挑战

背景概述

mMARCO数据集，作为MS MARCO篇章排名数据集的多语言版本，由Luiz Bonifacio、Israel Campiotti、Roberto de Alencar Lotufo和Rodrigo Frassetto Nogueira等研究人员开发并于2021年发布。该数据集旨在解决跨语言信息检索中的关键问题，支持包括英语、中文、法语、德语、印尼语、意大利语、葡萄牙语、俄语、西班牙语、阿拉伯语、荷兰语、印地语、日语和越南语在内的14种语言。mMARCO数据集的构建，为多语言信息检索领域提供了宝贵的资源，对促进相关算法研究和模型开发具有显著影响。

当前挑战

在构建mMARCO数据集的过程中，研究人员面临了诸多挑战。首先，跨语言信息的准确翻译和篇章排名的准确性验证是两个主要难题。其次，数据集构建过程中需要确保不同语言之间的数据质量和一致性，这对于多语言模型的训练至关重要。此外，如何有效地处理和整合多种语言的数据，以及如何在保持数据集规模的同时确保其多样性，也是构建过程中需要克服的关键挑战。

常用场景

经典使用场景

在自然语言处理领域，mMARCO数据集作为多语言版本的MS MARCO段落排名数据集，其经典的使用场景在于构建和评估跨语言信息检索系统。该数据集提供了多种语言的查询和文档，使得研究者能够在不同语言环境中测试其信息检索模型的性能，进而优化多语言搜索系统的准确性和效率。

实际应用

在实际应用中，mMARCO数据集被广泛用于开发多语言搜索引擎，支持多语言用户进行信息检索，无论是在跨国公司的内部搜索系统，还是在面向全球用户的公共搜索引擎中，都能看到该数据集的应用成果。

衍生相关工作

基于mMARCO数据集，研究者们衍生出了一系列相关工作，包括跨语言信息检索模型的改进、多语言文档的语义理解、以及跨语言知识库的构建等。这些工作不仅推动了信息检索领域的发展，也为自然语言处理领域的其他研究方向提供了新的视角和数据资源。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集