RAG 检索质量评估数据集

Name: RAG 检索质量评估数据集
Creator: figshare
Published: 2025-09-06 06:09:11
License: 暂无描述

DataCite Commons2025-09-06 更新2025-09-08 收录

下载链接：

https://figshare.com/articles/dataset/RAG__/30067627/1

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集旨在评估检索质量对检索增强生成（RAG）系统生成效果的影响，共包含728个样本。在数据收集过程中，我们从公开的问答数据集（包括SQuAD、Natural Questions和WebQuestions）中筛选出728个不同复杂程度的真实用户查询，涵盖事实问答、多跳推理、观点查询等多种类型。对于每个查询，使用基于 BM25 和 Sentence-BERT 的混合检索策略从维基百科语料库中获取相关文档，每次查询检索 5 到 15 个文档。在数据处理阶段，我们通过人工标注和自动计算相结合的方式，提取了8个关键特征：3名标注者独立评估查询复杂度（1-10分）并取平均值。文档相关性是通过将基于注释（0-1）的相关性分数与预训练模型的预测值融合而获得的。通过Sentence-BERT计算文档之间的语义相似度，结合基于TF-IDF的多样性评分生成检索多样性特征。使用实体链接工具自动计算关键实体的覆盖率;根据文档内容的重叠程度量化信息冗余;将实际检索到的文档数量记录为检索深度;最后，五位专家将从相关性、准确性和完整性（0 到 100 分）三个维度对生成的响应进行评分，作为最终的质量标签。所有特征值均经过标准化处理，确保数据范围合理且符合实际应用场景，为RAG系统检索策略的优化和质量评估提供了真实可靠的研究依据。

This dataset is designed to assess the impact of retrieval quality on the generation performance of Retrieval-Augmented Generation (RAG) systems, consisting of 728 samples in total. During data collection, we screened 728 real-world user queries with varying levels of complexity from open-domain question answering datasets including SQuAD, Natural Questions, and WebQuestions, covering diverse types such as factual question answering, multi-hop reasoning, and opinion-seeking queries. For each query, a hybrid retrieval strategy combining BM25 and Sentence-BERT was employed to retrieve relevant documents from the Wikipedia corpus, with 5 to 15 documents retrieved per query. In the data processing phase, 8 key features were extracted via a hybrid approach of manual annotation and automatic computation: 1) Query Complexity: Three annotators independently rated the query complexity on a 1-10 scale, and the average score was taken as the final value; 2) Document Relevance: Obtained by fusing the annotation-based relevance score (ranging from 0 to 1) with the prediction output of a pre-trained model; 3) Semantic Similarity Between Retrieved Documents: Calculated using Sentence-BERT to measure pairwise semantic similarity among the retrieved documents; 4) Retrieval Diversity: Generated by combining the semantic similarity scores and the TF-IDF-based diversity score of the retrieved document set; 5) Key Entity Coverage: Automatically computed using an entity linking tool; 6) Information Redundancy: Quantified based on the degree of content overlap between the retrieved documents; 7) Retrieval Depth: Recorded as the actual number of documents retrieved for each query; 8) Final Response Quality: Scored by five experts across three dimensions—relevance, accuracy, and completeness—on a 0-100 scale, which serves as the ground-truth quality label. All feature values were standardized to ensure a reasonable data range that aligns with real-world application scenarios, providing reliable and authentic research support for the optimization and quality evaluation of RAG system retrieval strategies.

提供机构：

figshare

创建时间：

2025-09-06

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集包含728个样本，用于评估RAG系统的检索质量对生成效果的影响，涵盖了多种查询类型，并通过人工标注和自动计算提取了8个关键特征。

以上内容由遇见数据集搜集并总结生成