Vietnamese-THUIR-T2Ranking-gg-translated
收藏魔搭社区2025-12-04 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/5CD-AI/Vietnamese-THUIR-T2Ranking-gg-translated
下载链接
链接失效反馈官方服务:
资源简介:
# 📚 5CD-AI/Vietnamese-THUIR-T2Ranking-gg-translated
## 📝 Overview
**Vietnamese-THUIR-T2Ranking-gg-translated** is a large-scale dataset for *passage ranking* in Vietnamese.
It is translated from the original [THUIR/T2Ranking](https://huggingface.co/datasets/THUIR/T2Ranking) [1] using Google Translate, inspired by the approach of **mMARCO** [2].
The dataset aims to provide a large-scale dataset for research and applications in *Information Retrieval (IR)* in Vietnamese.
In IR, *passage ranking* is an essential and challenging task, typically involving two stages:
1. 🔍 **Passage retrieval** – retrieving candidate passages.
2. 📊 **Passage re-ranking** – re-ordering candidates for final ranking.
## 📥 Data Download
Data files follow the same structure as the original:
| 📂 Description | 🗂️ Filename | 🔢 Num Records | 📑 Format |
| --------------------------- | --------------------------- | -------------: | -------------------------- |
| Collection | collection_json_*_vi.tsv | 2,303,643 | tsv: pid, text_zh, text_vi |
| Queries Train | data_queries.train_json_vi.tsv | 258,042 | tsv: qid, text_zh, text_vi |
| Queries Dev | data_queries.dev_json_vi.tsv | 24,832 | tsv: qid, text_zh, text_vi |
| Queries Test | data_queries.test_json_vi.tsv | 24,832 | tsv: qid, text_zh, text_vi |
| Qrels Train for re-ranking | qrels.train.tsv | 1,613,421 | TREC qrels format |
| Qrels Dev for re-ranking | qrels.dev.tsv | 400,536 | TREC qrels format |
| Qrels Retrieval Train | qrels.retrieval.train.tsv | 744,663 | tsv: qid, pid |
| Qrels Retrieval Dev | qrels.retrieval.dev.tsv | 118,933 | tsv: qid, pid |
| BM25 Negatives | train.bm25.tsv | 200,359,731 | tsv: qid, pid, index |
| Hard Negatives | train.mined.tsv | 200,376,001 | tsv: qid, pid, index, score|
🚀 **How to download**
```bash
git lfs install
git clone https://huggingface.co/datasets/5CD-AI/Vietnamese-THUIR-T2Ranking-gg-translated
````
📂 **Folder structure:**
```
├── collection_json_*_vi.tsv
├── data_queries.train_json_vi.tsv
├── data_queries.dev_json_vi.tsv
├── data_queries.test_json_vi.tsv
├── qrels.train.tsv
├── qrels.dev.tsv
├── qrels.retrieval.train.tsv
├── qrels.retrieval.dev.tsv
├── train.bm25.tsv
└── train.mined.tsv
```
## 🗒️ Notes
* ⚠️ This dataset was **translated** using Google Translate, so some translations may be imperfect or unnatural.
## 📖 Reference
\[1] X. Xie et al., *T2Ranking: A Large-scale Chinese Benchmark for Passage Ranking*, in **Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’23)**, ACM, pp. 2681–2690, 2023. doi: 10.1145/3539618.3591874.
\[2] L. H. Bonifacio et al., *mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset*, **arXiv preprint** arXiv:2108.13897, 2021.
# 📚 5CD-AI/越南语THUIR-T2Ranking-谷歌翻译版
## 📝 数据集概述
**Vietnamese-THUIR-T2Ranking-gg-translated** 是一款面向越南语的大规模**段落排序(passage ranking)**数据集。
该数据集源自原始数据集[THUIR/T2Ranking](https://huggingface.co/datasets/THUIR/T2Ranking) [1],通过谷歌翻译(Google Translate)完成翻译,其构建思路借鉴了**mMARCO** [2] 的方法。
本数据集旨在为越南语领域的**信息检索(Information Retrieval, IR)**相关研究与应用提供大规模基准数据。
在信息检索领域中,段落排序是一项核心且极具挑战性的任务,通常包含两个关键阶段:
1. 🔍 **段落检索(Passage retrieval)**:召回候选段落集合;
2. 📊 **段落重排序(Passage re-ranking)**:对候选段落进行重排序以生成最终排名结果。
## 📥 数据下载
数据文件的组织结构与原始数据集保持一致:
| 📂 数据类型 | 🗂️ 文件名 | 🔢 记录数 | 📑 格式说明 |
| --------------------------- | --------------------------- | -------------: | -------------------------- |
| 段落集合 | collection_json_*_vi.tsv | 2,303,643 | TSV格式:pid、原文(中文)、译文(越南语) |
| 训练集查询 | data_queries.train_json_vi.tsv | 258,042 | TSV格式:qid、原文(中文)、译文(越南语) |
| 开发集查询 | data_queries.dev_json_vi.tsv | 24,832 | TSV格式:qid、原文(中文)、译文(越南语) |
| 测试集查询 | data_queries.test_json_vi.tsv | 24,832 | TSV格式:qid、原文(中文)、译文(越南语) |
| 重排序训练集Qrels | qrels.train.tsv | 1,613,421 | TREC Qrels格式 |
| 重排序开发集Qrels | qrels.dev.tsv | 400,536 | TREC Qrels格式 |
| 检索训练集Qrels | qrels.retrieval.train.tsv | 744,663 | TSV格式:qid、pid |
| 检索开发集Qrels | qrels.retrieval.dev.tsv | 118,933 | TSV格式:qid、pid |
| BM25负样本集 | train.bm25.tsv | 200,359,731 | TSV格式:qid、pid、索引 |
| 难例负样本集 | train.mined.tsv | 200,376,001 | TSV格式:qid、pid、索引、相关性得分|
🚀 **下载方式**
bash
git lfs install
git clone https://huggingface.co/datasets/5CD-AI/Vietnamese-THUIR-T2Ranking-gg-translated
📂 **目录结构:**
├── collection_json_*_vi.tsv
├── data_queries.train_json_vi.tsv
├── data_queries.dev_json_vi.tsv
├── data_queries.test_json_vi.tsv
├── qrels.train.tsv
├── qrels.dev.tsv
├── qrels.retrieval.train.tsv
├── qrels.retrieval.dev.tsv
├── train.bm25.tsv
└── train.mined.tsv
## 🗒️ 注意事项
* ⚠️ 本数据集通过谷歌翻译完成翻译,因此部分译文可能存在瑕疵或不够自然流畅。
## 📖 参考文献
[1] X. Xie 等, *T2Ranking: 面向段落排序的大规模中文基准数据集*, 发表于**第46届国际ACM信息检索与发展会议(SIGIR ’23)论文集**, ACM, 第2681–2690页, 2023. doi: 10.1145/3539618.3591874.
[2] L. H. Bonifacio 等, *mMARCO: MS MARCO段落排序数据集的多语言版本*, **arXiv预印本** arXiv:2108.13897, 2021.
提供机构:
maas
创建时间:
2025-01-08



