mteb/miracl-hard-negatives
收藏Hugging Face2025-05-04 更新2025-05-31 收录
下载链接:
https://hf-mirror.com/datasets/mteb/miracl-hard-negatives
下载链接
链接失效反馈官方服务:
资源简介:
MIRACL (Multilingual Information Retrieval Across a Continuum of Languages) 是一个多语言检索数据集,专注于跨18种不同语言进行搜索。该数据集的困难负样本版本是通过从BM25、e5-multilingual-large和e5-mistral-instruct中池化每个查询的前250个文档创建的。MIRACL数据集涵盖的语言既有类型上相近的,也有距离遥远的,涵盖了10个语言家族和13个子家族,与不同数量的公开可用资源相关联。在注释过程中进行了大量的自动启发式验证和手动评估以控制数据质量。MIRACL代表了大约五年人工注释者工作量的投资。该数据集的目的是促进跨语言连续体的检索研究,从而增强为世界各地的不同人口,特别是那些传统上服务不足的人口提供信息访问的能力。
MIRACL (Multilingual Information Retrieval Across a Continuum of Languages) is a multilingual retrieval dataset focusing on search across 18 different languages. The hard negative version of this dataset is created by pooling the top 250 documents per query from BM25, e5-multilingual-large, and e5-mistral-instruct. The languages covered in the MIRACL dataset span both typologically close and distant languages across 10 language families and 13 sub-families, associated with varying amounts of publicly available resources. Extensive automatic heuristic verification and manual assessments were conducted during the annotation process to ensure data quality. MIRACL represents an investment of approximately five person-years of human annotator effort, aiming to advance research on retrieval across a continuum of languages to enhance information access capabilities for diverse populations around the world, particularly those that have been traditionally underserved.
提供机构:
mteb



