hotchpotch/miracl-hf-unified
收藏Hugging Face2025-07-04 更新2025-08-30 收录
下载链接:
https://hf-mirror.com/datasets/hotchpotch/miracl-hf-unified
下载链接
链接失效反馈官方服务:
资源简介:
MIRACL Unified Dataset 是一个统一的、标准化的 MIRACL 数据集版本,专为与 Hugging Face 生态系统无缝集成而优化。该数据集提供了多语言信息检索数据,格式清晰、标准,涵盖 18 种语言的维基百科语料库,包括阿拉伯语、孟加拉语、英语、西班牙语、波斯语、芬兰语、法语、印地语、印尼语、日语、韩语、俄语、斯瓦希里语、泰卢固语、泰语、中文、德语和约鲁巴语。每个语言都提供两个互补的数据集:语料库数据集(包含文档的标题和文本)和查询数据集(包含查询、正例和负例的序列)。
The MIRACL Unified Dataset is a unified, standardized version of the MIRACL dataset, optimized for seamless integration with the Hugging Face ecosystem. It provides multilingual information retrieval data in a clean, standardized format across 18 languages, including Arabic, Bengali, English, Spanish, Persian, Finnish, French, Hindi, Indonesian, Japanese, Korean, Russian, Swahili, Telugu, Thai, Chinese, German, and Yoruba. Each language provides two complementary datasets: a corpus dataset (containing document titles and texts) and a query dataset (containing queries and sequences of positive and negative examples).
提供机构:
hotchpotch



