five

minhnguyent546/mmarco-vietnamese-split

收藏
Hugging Face2026-03-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/minhnguyent546/mmarco-vietnamese-split
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: collection features: - name: id dtype: int32 - name: text dtype: string splits: - name: collection num_bytes: 4132323820 num_examples: 8841823 download_size: 1911256519 dataset_size: 4132323820 - config_name: queries features: - name: id dtype: int32 - name: text dtype: string splits: - name: train num_bytes: 45632748 num_examples: 808731 - name: dev.full num_bytes: 5750589 num_examples: 101093 - name: dev num_bytes: 368005 num_examples: 6980 download_size: 28697870 dataset_size: 51751342 - config_name: triples features: - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 39512849862 num_examples: 39780811 download_size: 13580125249 dataset_size: 39512849862 configs: - config_name: collection data_files: - split: collection path: collection/collection-* - config_name: queries data_files: - split: train path: queries/train-* - split: dev.full path: queries/dev.full-* - split: dev path: queries/dev-* - config_name: triples data_files: - split: train path: triples/train-* default: true license: apache-2.0 task_categories: - text-ranking language: - vi size_categories: - 10M<n<100M --- # Dataset Summary This dataset contains Vietnamese split of the [mMARCO dataset](https://huggingface.co/datasets/unicamp-dl/mmarco). | Subset | Split | # Rows | | :---: | :---: | ---: | | triples | train | 39,780,811 | | queries | train | 808,731 | | queries | dev.full | 101,093 | | queries | dev | 6980 | | collection | collection | 8,841,823 | *Note:* `triples` contains (query, positive, negative) triples and can be used for training embeddings models. ## Citing ```bibtex @article{DBLP:journals/corr/abs-2108-13897, author = {Luiz Bonifacio and Israel Campiotti and Roberto de Alencar Lotufo and Rodrigo Frassetto Nogueira}, title = {mMARCO: {A} Multilingual Version of {MS} {MARCO} Passage Ranking Dataset}, journal = {CoRR}, volume = {abs/2108.13897}, year = {2021}, url = {https://arxiv.org/abs/2108.13897}, eprinttype = {arXiv}, eprint = {2108.13897}, timestamp = {Mon, 20 Mar 2023 15:35:34 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-2108-13897.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } ```
提供机构:
minhnguyent546
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作