# Dataset Summary
**mMARCO** is a multilingual version of the [MS MARCO passage ranking dataset](https://microsoft.github.io/msmarco/).
For more information, checkout our papers:
* [**mMARCO: A Multilingual Version of the MS MARCO Passage Ranking Dataset**](https://arxiv.org/abs/2108.13897)
* [**A cost-benefit analysis of cross-lingual transfer methods**](https://arxiv.org/abs/2105.06813)
The first (deprecated) version comprises 8 languages: Chinese, French, German, Indonesian, Italian, Portuguese, Russian and Spanish. The current version included translations for Japanese, Dutch, Vietnamese, Hindi and Arabic. The current version is composed of 14 languages (including the original English version).
### Supported languages
| Language name | Language code |
|---------------|---------------|
| English | english |
| Chinese | chinese |
| French | french |
| German | german |
| Indonesian | indonesian |
| Italian | italian |
| Portuguese | portuguese |
| Russian | russian |
| Spanish | spanish |
| Arabic | arabic |
| Dutch | dutch |
| Hindi | hindi |
| Japanese | japanese |
| Vietnamese | vietnamese |
# Dataset Structure
You can load mMARCO dataset by choosing a specific language. We include training triples (query, positive and negative example), the translated collections of documents and queries.
#### Training triples
```python
>>> dataset = load_dataset('unicamp-dl/mmarco', 'english')
>>> dataset['train'][1]
{'query': 'what fruit is native to australia', 'positive': 'Passiflora herbertiana. A rare passion fruit native to Australia. Fruits are green-skinned, white fleshed, with an unknown edible rating. Some sources list the fruit as edible, sweet and tasty, while others list the fruits as being bitter and inedible.assiflora herbertiana. A rare passion fruit native to Australia. Fruits are green-skinned, white fleshed, with an unknown edible rating. Some sources list the fruit as edible, sweet and tasty, while others list the fruits as being bitter and inedible.', 'negative': 'The kola nut is the fruit of the kola tree, a genus (Cola) of trees that are native to the tropical rainforests of Africa.'}
```
#### Queries
```python
>>> dataset = load_dataset('unicamp-dl/mmarco', 'queries-spanish')
>>> dataset['train'][1]
{'id': 634306, 'text': '¿Qué significa Chattel en el historial de crédito'}
```
#### Collection
```python
>>> dataset = load_dataset('unicamp-dl/mmarco', 'collection-portuguese')
>>> dataset['collection'][100]
{'id': 100, 'text': 'Antonín Dvorák (1841-1904) Antonin Dvorak era filho de açougueiro, mas ele não seguiu o negócio de seu pai. Enquanto ajudava seu pai a meio tempo, estudou música e se formou na Escola de Órgãos de Praga em 1859.'}
```
### Licensing Information
This dataset is released under [Apache license 2.0](https://www.apache.org/licenses/).
# Citation Information
```
@article{DBLP:journals/corr/abs-2108-13897,
author = {Luiz Bonifacio and
Israel Campiotti and
Roberto de Alencar Lotufo and
Rodrigo Frassetto Nogueira},
title = {mMARCO: {A} Multilingual Version of {MS} {MARCO} Passage Ranking Dataset},
journal = {CoRR},
volume = {abs/2108.13897},
year = {2021},
url = {https://arxiv.org/abs/2108.13897},
eprinttype = {arXiv},
eprint = {2108.13897},
timestamp = {Mon, 20 Mar 2023 15:35:34 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2108-13897.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
```
# 数据集概述
**mMARCO** 是[MS MARCO段落排序数据集](https://microsoft.github.io/msmarco/)的多语言版本。如需了解更多细节,请参阅以下论文:
* [**mMARCO:MS MARCO段落排序数据集的多语言版本**](https://arxiv.org/abs/2108.13897)
* [**跨语言迁移方法的成本效益分析**](https://arxiv.org/abs/2105.06813)
首个(已废弃)版本包含8种语言:中文、法语、德语、印尼语、意大利语、葡萄牙语、俄语与西班牙语。当前版本新增了日语、荷兰语、越南语、印地语与阿拉伯语的翻译内容,整体涵盖14种语言(包含原始英文版本)。
### 支持的语言
| 语言名称 | 语言代码 |
|---------------|---------------|
| 英语 | english |
| 中文 | chinese |
| 法语 | french |
| 德语 | german |
| 印尼语 | indonesian |
| 意大利语 | italian |
| 葡萄牙语 | portuguese |
| 俄语 | russian |
| 西班牙语 | spanish |
| 阿拉伯语 | arabic |
| 荷兰语 | dutch |
| 印地语 | hindi |
| 日语 | japanese |
| 越南语 | vietnamese |
# 数据集结构
您可通过选择特定语言加载mMARCO数据集。本数据集包含训练三元组(查询、正样本与负样本),以及经过翻译的文档集合与查询文本。
#### 训练三元组
python
>>> dataset = load_dataset('unicamp-dl/mmarco', 'english')
>>> dataset['train'][1]
{'query': 'what fruit is native to australia', 'positive': 'Passiflora herbertiana. A rare passion fruit native to Australia. Fruits are green-skinned, white fleshed, with an unknown edible rating. Some sources list the fruit as edible, sweet and tasty, while others list the fruits as being bitter and inedible.assiflora herbertiana. A rare passion fruit native to Australia. Fruits are green-skinned, white fleshed, with an unknown edible rating. Some sources list the fruit as edible, sweet and tasty, while others list the fruits as being bitter and inedible.', 'negative': 'The kola nut is the fruit of the kola tree, a genus (Cola) of trees that are native to the tropical rainforests of Africa.'}
#### 查询样本
python
>>> dataset = load_dataset('unicamp-dl/mmarco', 'queries-spanish')
>>> dataset['train'][1]
{'id': 634306, 'text': '¿Qué significa Chattel en el historial de crédito'}
#### 文档集合
python
>>> dataset = load_dataset('unicamp-dl/mmarco', 'collection-portuguese')
>>> dataset['collection'][100]
{'id': 100, 'text': 'Antonín Dvorák (1841-1904) Antonin Dvorak era filho de açougueiro, mas ele não seguiu o negócio de seu pai. Enquanto ajudava seu pai a meio tempo, estudou música e se formou na Escola de Órgãos de Praga em 1859.'}
### 授权信息
本数据集基于[Apache许可证2.0](https://www.apache.org/licenses/)发布。
# 引用信息
@article{DBLP:journals/corr/abs-2108-13897,
author = {Luiz Bonifacio and
Israel Campiotti and
Roberto de Alencar Lotufo and
Rodrigo Frassetto Nogueira},
title = {mMARCO: {A} Multilingual Version of {MS} {MARCO} Passage Ranking Dataset},
journal = {CoRR},
volume = {abs/2108.13897},
year = {2021},
url = {https://arxiv.org/abs/2108.13897},
eprinttype = {arXiv},
eprint = {2108.13897},
timestamp = {Mon, 20 Mar 2023 15:35:34 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2108-13897.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}