langtech-innovation/trilingual_query_relevance

Name: langtech-innovation/trilingual_query_relevance
Creator: langtech-innovation
Published: 2025-03-22 08:40:08
License: 暂无描述

Hugging Face2025-03-22 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/langtech-innovation/trilingual_query_relevance

下载链接

链接失效反馈

官方服务：

资源简介：

这是一个针对英语（EN）、加泰罗尼亚语（CA）和西班牙语（SP）优化的查询-上下文相关性关联数据集，用于文档检索。该数据集由 projecte_aina/RAG_Multilingual 和 PaDaS-Lab/webfaq-retrieval 数据集创建，旨在微调嵌入模型，用于检索增强生成应用。数据集的上下文限制为 RAG_Multilingual 数据集中原始事实提取位置的前后句子。大约50%的记录被标记为与正确的查询-上下文配对相关（相关性得分为1.0），而另一半则随机混合查询和上下文（但保持同种语言），并分配0.0的相关性得分。数据集分为80%的训练集和20%的验证集。

This is a query-context relevance correlation dataset optimized for document retrieval in English (EN), Catalan (CA), and Spanish (SP). It was created from the projecte_aina/RAG_Multilingual and PaDaS-Lab/webfaq-retrieval datasets to fine-tune embedding models for use in Retrieval Augmented Generation applications. The context is limited to the previous and following sentences where the original extractive factoid was located in the sources for the RAG_Multilingual dataset. Approximately 50% of the records are ranked as relevant (with a relevance score of 1.0) for the correct query-context pairing, while the other half has queries and contexts mixed up randomly (but in the same language) and assigned a relevance score of 0.0. The dataset is split into 80% training and 20% validation sets.

提供机构：

langtech-innovation

5,000+

优质数据集

54 个

任务类型

进入经典数据集