mahiyama/JaGovFaqs-22k
收藏Hugging Face2026-04-23 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/mahiyama/JaGovFaqs-22k
下载链接
链接失效反馈官方服务:
资源简介:
JaGovFaqs-22k是一个基于日本政府FAQ构建的日语检索模型微调数据集,包含22,794个FAQ条目。数据集分为三个子集:pairs(查询-答案对)、triplets(包含硬负例的三元组)和triplets-ruri-scored(包含教师评分的硬负例三元组)。这些子集分别用于不同的训练场景,如多负例排名损失、稀疏多负例排名损失和KL散度蒸馏。数据集语言为日语,许可证为CC-BY-4.0,适用于稀疏/密集/交叉编码器的日语检索模型微调、KL蒸馏的教师标签以及QA/对话模型的训练。
JaGovFaqs-22k is a Japanese retrieval model fine-tuning dataset based on Japanese government FAQs, containing 22,794 FAQ entries. The dataset is divided into three subsets: pairs (query-answer pairs), triplets (triplets with hard negatives), and triplets-ruri-scored (triplets with teacher-scored hard negatives). These subsets are designed for different training scenarios, such as multiple negatives ranking loss, sparse multiple negatives ranking loss, and KL divergence distillation. The dataset is in Japanese, licensed under CC-BY-4.0, and suitable for fine-tuning Japanese retrievers (sparse/dense/cross-encoder), teacher labels for KL distillation, and training QA/dialogue models.
提供机构:
mahiyama



