minhnguyent546/mmarco-vietnamese-split
收藏Hugging Face2026-03-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/minhnguyent546/mmarco-vietnamese-split
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: collection
features:
- name: id
dtype: int32
- name: text
dtype: string
splits:
- name: collection
num_bytes: 4132323820
num_examples: 8841823
download_size: 1911256519
dataset_size: 4132323820
- config_name: queries
features:
- name: id
dtype: int32
- name: text
dtype: string
splits:
- name: train
num_bytes: 45632748
num_examples: 808731
- name: dev.full
num_bytes: 5750589
num_examples: 101093
- name: dev
num_bytes: 368005
num_examples: 6980
download_size: 28697870
dataset_size: 51751342
- config_name: triples
features:
- name: query
dtype: string
- name: positive
dtype: string
- name: negative
dtype: string
splits:
- name: train
num_bytes: 39512849862
num_examples: 39780811
download_size: 13580125249
dataset_size: 39512849862
configs:
- config_name: collection
data_files:
- split: collection
path: collection/collection-*
- config_name: queries
data_files:
- split: train
path: queries/train-*
- split: dev.full
path: queries/dev.full-*
- split: dev
path: queries/dev-*
- config_name: triples
data_files:
- split: train
path: triples/train-*
default: true
license: apache-2.0
task_categories:
- text-ranking
language:
- vi
size_categories:
- 10M<n<100M
---
# Dataset Summary
This dataset contains Vietnamese split of the [mMARCO dataset](https://huggingface.co/datasets/unicamp-dl/mmarco).
| Subset | Split | # Rows |
| :---: | :---: | ---: |
| triples | train | 39,780,811 |
| queries | train | 808,731 |
| queries | dev.full | 101,093 |
| queries | dev | 6980 |
| collection | collection | 8,841,823 |
*Note:* `triples` contains (query, positive, negative) triples and can be used for training embeddings models.
## Citing
```bibtex
@article{DBLP:journals/corr/abs-2108-13897,
author = {Luiz Bonifacio and
Israel Campiotti and
Roberto de Alencar Lotufo and
Rodrigo Frassetto Nogueira},
title = {mMARCO: {A} Multilingual Version of {MS} {MARCO} Passage Ranking Dataset},
journal = {CoRR},
volume = {abs/2108.13897},
year = {2021},
url = {https://arxiv.org/abs/2108.13897},
eprinttype = {arXiv},
eprint = {2108.13897},
timestamp = {Mon, 20 Mar 2023 15:35:34 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2108-13897.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
```
提供机构:
minhnguyent546



