hotchpotch/JFWIR
收藏Hugging Face2025-06-20 更新2025-08-30 收录
下载链接:
https://hf-mirror.com/datasets/hotchpotch/JFWIR
下载链接
链接失效反馈官方服务:
资源简介:
JFWIR是一个大规模的日本语信息检索数据集,包含约6400万个文档-查询对,旨在解决传统日本语信息检索数据集偏向维基百科的问题。数据集基于高质量的教育网络爬虫数据fineweb-2-edu-japanese构建,并使用query-crafter-japanese模型生成7种查询类型。此外,数据集还包括了硬负样本,用于有效的对比学习。JFWIR数据集已在Hugging Face Datasets上公开发布,并提供了使用示例。
JFWIR (Japanese FineWeb Information Retrieval) is a large-scale Japanese information retrieval dataset with approximately 64 million document-query pairs, created to address the challenge of traditional Japanese IR datasets being biased towards Wikipedia. The dataset is built upon high-quality educational web crawl data fineweb-2-edu-japanese and features 7 different query types generated for each document. It also includes hard negatives for effective contrastive learning. JFWIR is accessible through Hugging Face Datasets and provides essential elements for developing information retrieval systems.
提供机构:
hotchpotch



