mahiyama/mmarco-ja

Name: mahiyama/mmarco-ja
Creator: mahiyama
Published: 2026-04-23 07:27:00
License: 暂无描述

Hugging Face2026-04-23 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/mahiyama/mmarco-ja

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是基于日语MS MARCO段落排名数据集hpprc/mmarco-ja（包含391,060个查询和8,841,823个段落）加工而成的，适用于SPLADE等稀疏检索模型学习的Triplets/Pairs格式数据集。数据集包含日文翻译的MS MARCO查询和段落，以及用于对比学习的硬负例。数据集分为triplets和pairs两个子集，分别用于对比学习和成对学习。数据集的构建过程包括集合的物化、空段落的检测、Triplet构建、训练/评估分割等步骤。数据集的设计决策包括不使用密集编码器进行挖掘、固定正例为pos_ids[0]等。数据集存在机械翻译质量、假负例残留、空段落混入等已知问题。数据集遵循MS MARCO的使用条款，仅限于非商业研究用途。

This dataset is a processed version of the Japanese MS MARCO passage ranking dataset hpprc/mmarco-ja (containing 391,060 queries and 8,841,823 passages) into Triplets/Pairs formats suitable for training sparse retrieval models like SPLADE. It includes Japanese translations of MS MARCO queries and passages, along with hard negatives for contrastive learning. The dataset is divided into triplets and pairs subsets for contrastive and pairwise learning, respectively. The construction pipeline involves materializing the collection, detecting empty passages, building triplets, and splitting into train/eval sets. Design decisions include not using dense encoders for mining and fixing positives as pos_ids[0]. Known issues include machine translation quality, residual false negatives, and mixed empty passages. The dataset adheres to MS MARCOs terms of use, restricted to non-commercial research purposes.

提供机构：

mahiyama

5,000+

优质数据集

54 个

任务类型

进入经典数据集