hpprc/msmarco-ja
收藏Hugging Face2024-11-20 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/hpprc/msmarco-ja
下载链接
链接失效反馈官方服务:
资源简介:
MSMARCO-Ja数据集是通过使用日语对应的LLM将英语的MSMARCO数据集翻译成日语而创建的。该数据集的目的是通过提高翻译质量来提升后续模型的性能。数据集包含多个子集,如`collection`、`collection-filtered`、`collection-sim`、`dataset`、`dataset-filtered`、`dataset-llm-score`和`dataset-sim`,每个子集都有特定的用途和特征。例如,`collection-sim`和`dataset-sim`子集包含了使用Multilingual E5 large模型计算的英语-日语翻译对的余弦相似度。此外,`collection-filtered`和`dataset-filtered`子集通过筛选保留每个英语实例的最佳日语翻译。
The MSMARCO-Ja dataset is created by translating the English MSMARCO dataset into Japanese using Japanese-compatible LLMs. The purpose of this dataset is to improve the performance of subsequent models by enhancing the quality of translation. The dataset includes multiple subsets such as `collection`, `collection-filtered`, `collection-sim`, `dataset`, `dataset-filtered`, `dataset-llm-score`, and `dataset-sim`, each with specific uses and features. For example, the `collection-sim` and `dataset-sim` subsets include cosine similarity calculations for English-Japanese translation pairs using the Multilingual E5 large model. Additionally, the `collection-filtered` and `dataset-filtered` subsets retain only the best Japanese translation for each English instance through filtering.
提供机构:
hpprc



