illuin/grouse
收藏Hugging Face2024-12-13 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/illuin/grouse
下载链接
链接失效反馈官方服务:
资源简介:
---
license:
- mit
language:
- en
multilinguality:
- monolingual
annotations_creators:
- expert-generated
pretty_name: GroUSE
size_categories:
- n<1K
tags:
- rag
- evaluation
- meta-evaluation
configs:
- config_name: default
data_files:
- split: train
path: "train.jsonl"
- split: test
path: "test.jsonl"
---
# Dataset Card for GroUSE
GroUSE (*Grounded QA Unitary Scoring of Evaluators*) is a dataset designed to assess the performance of Grounded QA evaluators. Its purpose is to evaluate whether an LLM, when used as a grounded QA evaluator, delivers the expected scores across six metrics when presented with both good and imperfect answers.
## Dataset Details
### Dataset Description
Each sample is of the following form :
```json
{
"references": [
"[Content of the 1st reference]",
"[Content of the 2nd reference]",
// ...
],
"input": "[Question]",
"expected_output": "[Ground truth answer]",
"actual_output": "[Answer to evaluate, can contain mistakes]",
"conditions": {
"answer_relevancy_condition": "<5",
"completeness_condition": "==5",
"faithfulness_condition": "==1",
"usefulness_condition": "==None"
},
"metadata": {
"test_type": "Low answer relevancy 1",
"goal": "Relevancy is low when answer has irrelevant information."
}
}
```
- **Curated by:** Sacha Muller
- **Funded by:** Illuin Technology
- **Language:** English
- **License:** MIT
### Dataset Sources
- **Repository:** [github.com/illuin-tech/grouse](https://github.com/illuin-tech/grouse)
- **Paper:** [arxiv.org/abs/2409.06595](https://arxiv.org/abs/2409.06595)
## Uses
The dataset is intended to be used with the [GroUSE repository](https://github.com/illuin-tech/grouse).
## Dataset Structure
The GroUSE dataset comprises 144 samples organized into 9 sets. Every set shares a common question and mostly similar references, with slight variations in the answers. The tests in each set correspond to a predefined typology of 16 test types designed to assess whether an evaluator appropriately penalizes all failure modes and rewards accurate answers across a diverse range of scenarios. Each test type specifies the expected characteristics for both references and answers, and defines an acceptable range of scores for each metric to be deemed valid. The tests focus primarily on edge cases or the detection of subtle errors.
An additional set is available as a "training" set to assist in engineering the prompt for the judge model being tested.
<img src="all_test_types.png" alt="A detailed table presenting 16 type of tests, their goals, failure modes, and the characteristics of the references and answers, along with expected scores in various criteria. The first seven tests focus on checking if correct answers receive the highest marks in different situations. The remaining tests assess specific failure modes such as low relevancy, low completeness, low usefulness, and low faithfulness of answers." style="width:900px;"/>
## Context
### Grounded QA Task
Grounded QA is usually the last step of a RAG pipeline: given a question and a set of documents retrieved from the corpus, a LLM must generate an answer to the question. We expect the LLM to cite from which document each information is coming, as depicted below. When no precise answer is in the documents the LLM should indicate it in its answer. If some related information are available in the documents, the LLM can add them to the answer to show the corpus is not completely off topic with the question.
<img src="grounded_qa_cases.png" alt="Schema showing an example depending on whether the references contain a precise answer, only related information or no information. For each case there is an example of references and ground truth answer. The question is common to the three cases : What is the relationship between Pluto and Neptune. Case 1 : the references contain a precise answer. Reference 1 : More than 200 objects in 2:3 resonance are known (meaning they complete exactly 2 revolutions around the Sun when Neptune completes 3), among which are Pluto and its moons. Reference 2 : Pluto’s axis of rotation is tilted at 57.5 degrees relative to its orbital plane, which is quite high and unusual in the Solar System. Reference 3 : On the left: view of a cardiac cycle, of a systolic-diastolic oscillating flow, characteristic of circulatory arrest. Ground truth answer : The 3:2 orbital resonance relationship between Pluto and Neptune means that for every 3 revolutions of Neptune around the Sun, Pluto completes 2 [reference 1 citation]. Case 2 : References only contain related information. The reference 1 containing a precise information was removed, the two others are left. Ground truth answer : No document seems to precisely answer your question. However, the documents indicate that : Pluto’s axis of rotation is tilted at 57.5 degrees [reference 2 citation]. Case 3 : References contain no answer nor related information. Reference 1 and 2 were removed, only reference 3 which is off topic if left. Ground truth answer : No document seems to precisely answer your question." style="width:800px;"/>
### Grounded QA Evaluation
We propose 6 metrics to evaluate the quality of a grounded QA answer :
- **Answer relevancy** assesses the relevance of the information provided in the answer regarding the question, using a Likert scale (1 to 5).
- **Completeness** also uses a Likert scale to evaluate whether all relevant information from the documents is present in the answer.
- **Faithfulness** is a binary score that checks if all facts in the answer are accurate and correctly attributed to the corresponding document.
- In adversarial cases and when additional information is provided, **Usefulness** is a binary score that determines if the provided additional information is indeed useful and relevant to the question.
- **Positive Acceptance** and **Negative Rejection** are binary scores indicating a true positive and a true negative respectively in identifying whether the question is answerable.
### Performances on the dataset
<table>
<thead>
<tr>
<td colspan="2"></td>
<th colspan="7">Agreement rate of metrics on GroUSE</th>
</tr>
<tr>
<th></th>
<th></th>
<th>Answer relevancy</th>
<th>Completeness</th>
<th>Usefulness</th>
<th>Faithfulness</th>
<th>Positive acceptance</th>
<th style="border-right: 1px solid;">Negative rejection</th>
<th>Total test pass rate</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="11">Each metric evaluated in a separate prompt</td>
<td>GPT-4</td>
<td><strong>91.67</strong></td>
<td><strong>88.89</strong></td>
<td><strong>100.0</strong></td>
<td>92.36</td>
<td><strong>98.61</strong></td>
<td style="border-right: 1px solid;"><strong>98.61</strong></td>
<td><strong>95.02</strong></td>
</tr>
<tr>
<td>GPT-4o</td>
<td>79.17</td>
<td>77.08</td>
<td>97.92</td>
<td>92.36</td>
<td>83.33</td>
<td style="border-right: 1px solid;">83.33</td>
<td>85.53</td>
</tr>
<tr>
<td>GPT-4-turbo</td>
<td>90.28</td>
<td>85.42</td>
<td>97.22</td>
<td><strong>93.75</strong></td>
<td>94.44</td>
<td style="border-right: 1px solid;">94.44</td>
<td>92.59</td>
</tr>
<tr>
<td>GPT-3.5-turbo</td>
<td>88.89</td>
<td>50.00</td>
<td>80.56</td>
<td>68.06</td>
<td>77.78</td>
<td style="border-right: 1px solid;">61.81</td>
<td>71.18</td>
</tr>
<tr>
<td>Gemini 1.0 Pro</td>
<td>78.47</td>
<td>75.69</td>
<td>97.22</td>
<td>78.47</td>
<td>84.72</td>
<td style="border-right: 1px solid;">84.72</td>
<td>83.22</td>
</tr>
<tr>
<td>Mixtral 8x7b Instruct</td>
<td>81.25</td>
<td>61.11</td>
<td>81.25</td>
<td>72.22</td>
<td>76.39</td>
<td style="border-right: 1px solid;">75.69</td>
<td>74.65</td>
</tr>
<tr>
<td>Mixtral 8x22b Instruct</td>
<td>80.56</td>
<td>68.75</td>
<td>81.94</td>
<td>83.33</td>
<td>76.39</td>
<td style="border-right: 1px solid;">72.22</td>
<td>77.20</td>
</tr>
<tr>
<td>Prometheus 2 7b</td>
<td>72.22</td>
<td>41.67</td>
<td>16.67</td>
<td>38.19</td>
<td>73.61</td>
<td style="border-right: 1px solid;">74.31</td>
<td>52.78</td>
</tr>
<tr>
<td>Prometheus 2 8x7b</td>
<td>61.81</td>
<td>25.00</td>
<td>34.03</td>
<td>72.22</td>
<td>67.36</td>
<td style="border-right: 1px solid;">69.44</td>
<td>54.98</td>
</tr>
<tr>
<td>Llama-3 70b Instruct</td>
<td>90.28</td>
<td>63.89</td>
<td>76.39</td>
<td>73.61</td>
<td>85.42</td>
<td style="border-right: 1px solid;">85.42</td>
<td>79.17</td>
</tr>
<tr>
<td>Llama-3 8b Instruct</td>
<td>85.42</td>
<td>49.31</td>
<td>80.56</td>
<td>59.72</td>
<td>72.92</td>
<td style="border-right: 1px solid;">68.06</td>
<td>69.33</td>
</tr>
<tr>
<td rowspan="2">All metrics with one prompt</td>
<td>Llama-3 8b Instruct</td>
<td>31.25</td>
<td>18.06</td>
<td>34.03</td>
<td>56.94</td>
<td>52.78</td>
<td style="border-right: 1px solid;">46.53</td>
<td>39.93</td>
</tr>
<tr>
<td>Finetuned Llama 3 8b</td>
<td>88.89</td>
<td>81.94</td>
<td>81.25</td>
<td>52.78</td>
<td>91.67</td>
<td style="border-right: 1px solid;">91.67</td>
<td>81.37</td>
</tr>
<tr>
<td>Adapted protocol</td>
<td>Human annotators</td>
<td>98.26</td>
<td>92.36</td>
<td>97.92</td>
<td>95.49</td>
<td>96.53</td>
<td style="border-right: 1px solid;">96.88</td>
<td>96.24</td>
</tr>
</tbody>
</table>
## Dataset creation
### Annotation process
The grounding documents primarily consist of excerpts from Wikipedia, supplemented with manually scraped content from various sources such as news articles, popular science pieces, and medical papers. To simulate retrieval system noise, the references were intentionally altered by truncating sentences, mimicking poorly parsed tables, and including irrelevant headers or footers. To further replicate real-world retrieval challenges, are included in the dataset completely off topic documents as well as incomplete but contextually relevant references. As for the answers, those with perfect expected marks were written from scratch, and then slightly modified to match the other test types, sometimes with the help of an AI writing assistant, but always with final human corrections.
### Who are the annotators?
The GroUSE dataset was constructed by a single annotator who speaks fluent English.
### Personal and Sensitive Information
The dataset only contains publicly available informations.
## Bias, Risks, and Limitations
- The unit tests are designed to identify edge cases but do not account for intermediate performance levels. This focus on extreme scenarios might overlook nuances in model performance that are critical for a comprehensive evaluation.
- In addition, the tests were built within a single domain, specifically using Wikipedia as the knowledge base. Consequently, our findings may not generalize to out-of-domain scenarios. Future work should include diverse domains to test the robustness and adaptability of our evaluation framework.
## Citation
```
@misc{muller2024grouse,
title={GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering},
author={Sacha Muller and António Loison and Bilel Omrani and Gautier Viaud},
year={2024},
eprint={2409.06595},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.06595},
}
```
## Dataset Card Contact
For any question about the dataset please contact [antonio.loison@illuin.tech](mailto:antonio.loison@illuin.tech) or [gautier.viaud@illuin.tech](mailto:gautier.viaud@illuin.tech).
license:
- MIT
language:
- 英语
multilinguality:
- 单语言
annotations_creators:
- 专家生成
pretty_name: GroUSE
size_categories:
- n<1K
tags:
- 检索增强生成(Retrieval-Augmented Generation,RAG)
- 评估
- 元评估
configs:
- config_name: default
data_files:
- split: 训练集
path: "train.jsonl"
- split: 测试集
path: "test.jsonl"
---
# GroUSE数据集卡片
GroUSE(全称为*Grounded QA Unitary Scoring of Evaluators*,即基于文档问答的评估者统一打分基准)是一款用于评估基于文档问答(Grounded QA)评估者性能的数据集。其核心目标是验证:当大语言模型(Large Language Model,LLM)被用作基于文档问答的评估者时,在面对优质答案与存在瑕疵的答案时,能否在六项指标上给出符合预期的打分。
## 数据集详情
### 数据集描述
每个样本的格式如下:
json
{
"references": [
"[第一份参考文档的内容]",
"[第二份参考文档的内容]",
// ...
],
"input": "[问题]",
"expected_output": "[标准答案]",
"actual_output": "[待评估答案,可包含错误]",
"conditions": {
"answer_relevancy_condition": "<5",
"completeness_condition": "==5",
"faithfulness_condition": "==1",
"usefulness_condition": "==None"
},
"metadata": {
"test_type": "低相关性测试1",
"goal": "当答案包含无关信息时,相关性得分较低。"
}
}
- **整理者**:萨沙·穆勒(Sacha Muller)
- **资助方**:Illuin Technology
- **语言**:英语
- **许可证**:MIT
### 数据集来源
- **代码仓库**:[github.com/illuin-tech/grouse](https://github.com/illuin-tech/grouse)
- **相关论文**:[arxiv.org/abs/2409.06595](https://arxiv.org/abs/2409.06595)
## 使用场景
本数据集需配合[GroUSE代码仓库](https://github.com/illuin-tech/grouse)使用。
## 数据集结构
GroUSE数据集共包含144个样本,分为9个组别。每个组别共享同一问题与大体一致的参考文档,仅答案存在细微差异。每个组内的测试对应16种预定义的测试类型,旨在评估评估者能否在多样化场景中,对所有失效模式合理扣分,并对准确答案给予正向奖励。每种测试类型均明确了参考文档与答案的预期特征,并定义了各项指标的合格分数区间。该数据集的测试主要聚焦于极端场景或细微错误的检测。
此外,数据集还提供了一个“训练”子集,用于辅助优化待测试的评估模型的提示词(Prompt)。
<img src="all_test_types.png" alt="一张详细表格,展示了16种测试类型的目标、失效模式、参考文档与答案的特征,以及各指标的预期分数。其中前7项测试用于验证不同场景下正确答案能否获得最高分,其余测试则针对特定失效模式,如答案相关性低、完整性不足、实用性欠缺以及答案忠实度低下。" style="width:900px;"/>
## 背景
### 基于文档的问答任务
基于文档的问答(Grounded QA)通常是检索增强生成(Retrieval-Augmented Generation,RAG)流程的最后一步:给定一个问题与从语料库中检索到的一组文档,大语言模型需要生成该问题的答案。我们要求大语言模型为答案中的每条信息标注其来源文档,如下文示例所示。若文档中无精准答案,大语言模型应在答案中说明这一点。若文档中包含相关信息,大语言模型可将其加入答案,以表明语料库与该问题并非完全无关。
<img src="grounded_qa_cases.png" alt="一张示意图,展示了三种场景下的示例:参考文档包含精准答案、仅包含相关信息、不包含任何相关信息。每种场景均配有参考文档与标准答案示例,统一问题为:冥王星与海王星之间存在何种关系。场景1:参考文档包含精准答案。参考文档1:已知有超过200个天体处于2:3轨道共振状态(即海王星每公转3圈,该天体恰好公转2圈),其中包括冥王星及其卫星。参考文档2:冥王星的自转轴相对于轨道平面的倾角为57.5度,这在太阳系中相当罕见且特殊。参考文档3:左图展示了心动周期、收缩舒张振荡流,这是循环骤停的特征。标准答案:冥王星与海王星之间存在3:2轨道共振关系,即海王星每公转3圈,冥王星完成2圈[引用参考文档1]。场景2:参考文档仅包含相关信息。移除包含精准信息的参考文档1,保留其余两份参考文档。标准答案:暂无文档可精准回答您的问题。不过,文档显示:冥王星的自转轴倾角为57.5度[引用参考文档2]。场景3:参考文档不包含任何答案或相关信息。移除参考文档1与2,仅保留与主题无关的参考文档3。标准答案:暂无文档可精准回答您的问题。" style="width:800px;"/>
### 基于文档的问答评估
我们提出了六项指标,用于评估基于文档问答的答案质量:
- **答案相关性**:采用李克特量表(1至5分),评估答案中提供的信息与问题的相关程度。
- **完整性**:同样采用李克特量表,评估答案是否涵盖了文档中所有相关信息。
- **忠实度**:为二元评分指标,用于验证答案中的所有事实均准确无误,且已正确标注来源文档。
- **实用性**:针对对抗场景与附加信息场景,为二元评分指标,用于判断提供的附加信息是否确实对问题有用且相关。
- **正向接受度**与**负向拒绝度**:均为二元评分指标,分别代表在判断问题是否可回答时的真阳性与真阴性结果。
### 数据集性能表现
<table>
<thead>
<tr>
<td colspan="2"></td>
<th colspan="7">GroUSE数据集各指标一致性率</th>
</tr>
<tr>
<th></th>
<th></th>
<th>答案相关性</th>
<th>完整性</th>
<th>实用性</th>
<th>忠实度</th>
<th>正向接受度</th>
<th style="border-right: 1px solid;">负向拒绝度</th>
<th>整体测试通过率</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="11">单指标单独提示词评估</td>
<td>GPT-4</td>
<td><strong>91.67</strong></td>
<td><strong>88.89</strong></td>
<td><strong>100.0</strong></td>
<td>92.36</td>
<td><strong>98.61</strong></td>
<td style="border-right: 1px solid;"><strong>98.61</strong></td>
<td><strong>95.02</strong></td>
</tr>
<tr>
<td>GPT-4o</td>
<td>79.17</td>
<td>77.08</td>
<td>97.92</td>
<td>92.36</td>
<td>83.33</td>
<td style="border-right: 1px solid;">83.33</td>
<td>85.53</td>
</tr>
<tr>
<td>GPT-4-turbo</td>
<td>90.28</td>
<td>85.42</td>
<td>97.22</td>
<td><strong>93.75</strong></td>
<td>94.44</td>
<td style="border-right: 1px solid;">94.44</td>
<td>92.59</td>
</tr>
<tr>
<td>GPT-3.5-turbo</td>
<td>88.89</td>
<td>50.00</td>
<td>80.56</td>
<td>68.06</td>
<td>77.78</td>
<td style="border-right: 1px solid;">61.81</td>
<td>71.18</td>
</tr>
<tr>
<td>Gemini 1.0 Pro</td>
<td>78.47</td>
<td>75.69</td>
<td>97.22</td>
<td>78.47</td>
<td>84.72</td>
<td style="border-right: 1px solid;">84.72</td>
<td>83.22</td>
</tr>
<tr>
<td>Mixtral 8x7b Instruct</td>
<td>81.25</td>
<td>61.11</td>
<td>81.25</td>
<td>72.22</td>
<td>76.39</td>
<td style="border-right: 1px solid;">75.69</td>
<td>74.65</td>
</tr>
<tr>
<td>Mixtral 8x22b Instruct</td>
<td>80.56</td>
<td>68.75</td>
<td>81.94</td>
<td>83.33</td>
<td>76.39</td>
<td style="border-right: 1px solid;">72.22</td>
<td>77.20</td>
</tr>
<tr>
<td>Prometheus 2 7b</td>
<td>72.22</td>
<td>41.67</td>
<td>16.67</td>
<td>38.19</td>
<td>73.61</td>
<td style="border-right: 1px solid;">74.31</td>
<td>52.78</td>
</tr>
<tr>
<td>Prometheus 2 8x7b</td>
<td>61.81</td>
<td>25.00</td>
<td>34.03</td>
<td>72.22</td>
<td>67.36</td>
<td style="border-right: 1px solid;">69.44</td>
<td>54.98</td>
</tr>
<tr>
<td>Llama-3 70b Instruct</td>
<td>90.28</td>
<td>63.89</td>
<td>76.39</td>
<td>73.61</td>
<td>85.42</td>
<td style="border-right: 1px solid;">85.42</td>
<td>79.17</td>
</tr>
<tr>
<td>Llama-3 8b Instruct</td>
<td>85.42</td>
<td>49.31</td>
<td>80.56</td>
<td>59.72</td>
<td>72.92</td>
<td style="border-right: 1px solid;">68.06</td>
<td>69.33</td>
</tr>
<tr>
<td rowspan="2">多指标统一提示词评估</td>
<td>Llama-3 8b Instruct</td>
<td>31.25</td>
<td>18.06</td>
<td>34.03</td>
<td>56.94</td>
<td>52.78</td>
<td style="border-right: 1px solid;">46.53</td>
<td>39.93</td>
</tr>
<tr>
<td>Finetuned Llama 3 8b</td>
<td>88.89</td>
<td>81.94</td>
<td>81.25</td>
<td>52.78</td>
<td>91.67</td>
<td style="border-right: 1px solid;">91.67</td>
<td>81.37</td>
</tr>
<tr>
<td>适配评估方案</td>
<td>人类标注者</td>
<td>98.26</td>
<td>92.36</td>
<td>97.92</td>
<td>95.49</td>
<td>96.53</td>
<td style="border-right: 1px solid;">96.88</td>
<td>96.24</td>
</tr>
</tbody>
</table>
## 数据集构建
### 标注流程
本数据集的参考文档主要取自维基百科节选,并补充了从新闻文章、科普文章与医学论文等多种来源手动爬取的内容。为模拟检索系统的噪声,参考文档被故意修改:包括截断句子、模拟解析错误的表格、添加无关的页眉或页脚。为进一步还原真实的检索场景挑战,数据集还包含了完全无关的文档,以及不完整但上下文相关的参考文档。对于答案部分,预期得分完美的答案均为手动原创撰写,随后根据其他测试类型的要求进行小幅修改,部分过程借助了AI写作助手,但最终均由人工审核修正。
### 标注人员信息
GroUSE数据集由一名英语流利的标注者构建完成。
### 个人与敏感信息
本数据集仅包含公开可用的信息。
## 偏差、风险与局限性
- 本数据集的单元测试旨在识别极端场景,但未考虑中间性能水平。这种对极端场景的聚焦可能会忽略模型性能中的细微差异,而这些差异对全面评估至关重要。
- 此外,所有测试均限定于单一领域,具体以维基百科作为知识库。因此,本研究的结论可能无法推广到跨领域场景。未来的工作应涵盖多样化领域,以测试我们的评估框架的鲁棒性与适应性。
## 引用格式
bibtex
@misc{muller2024grouse,
title={GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering},
author={Sacha Muller and António Loison and Bilel Omrani and Gautier Viaud},
year={2024},
eprint={2409.06595},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.06595},
}
## 数据集卡片联系方式
若对本数据集有任何疑问,请联系[antonio.loison@illuin.tech](mailto:antonio.loison@illuin.tech)或[gautier.viaud@illuin.tech](mailto:gautier.viaud@illuin.tech).
提供机构:
illuin



