reveal
收藏魔搭社区2025-12-05 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/google/reveal
下载链接
链接失效反馈官方服务:
资源简介:
# Reveal: A Benchmark for Verifiers of Reasoning Chains
## [Paper: A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains](https://arxiv.org/abs/2402.00559)
Link: https://arxiv.org/abs/2402.00559
Website: https://reveal-dataset.github.io/
Abstract:
Prompting language models to provide step-by-step answers (e.g., "Chain-of-Thought") is the prominent approach for complex reasoning tasks, where more accurate reasoning chains typically improve downstream task performance.
Recent literature discusses automatic methods to verify reasoning steps to evaluate and improve their correctness. However, no fine-grained step-level datasets are available to enable thorough evaluation of such verification methods, hindering progress in this direction.
We introduce Reveal: *Reasoning Verification Evaluation*, a new dataset to benchmark automatic verifiers of complex Chain-of-Thought reasoning in open-domain question answering settings.
Reveal includes comprehensive labels for the relevance, attribution to evidence passages, and logical correctness of each reasoning step in a language model's answer, across a wide variety of datasets and state-of-the-art language models.
### Usage
To load the dataset:
```python
! pip install datasets
from datasets import load_dataset
reveal = load_dataset("google/reveal")
reveal_eval = reveal['eval'] # select Reveal-Eval, the evaluation split
reveal_open = reveal['open'] # select Reveal-Open, the hard-cases split with low-confidence annotations
```
**Note: The above provides a table from `eval/reveal_eval.csv` for easily working at scale with the data. There is another file `eval/reveal_eval.json` with a more intuitive json structure, if you prefer this format.**
Some examples of how to handle the data by deriving step-level tasks:
```python
import pandas as pd
reveal_eval = pd.DataFrame(reveal_eval)
# Step Attribution task
eval_attr = reveal_eval[~reveal_eval.evidence.isna()].reset_index(drop=True)
eval_attr['decontextualized_step'] = eval_attr['decontextualized_step'].fillna(eval_attr['step'])
# Fields:
# Premise: [evidence]
# Hypothesis: [decontextualized_step]
# Gold label: [attribution_label]
# Step Logic task
def _make_history(row):
return row['question'] + ' ' + row['full_answer'].split(row['step'].strip())[0]
eval_logic = reveal_eval.drop_duplicates(subset=['answer_id', 'step_idx']).reset_index(drop=True)
eval_logic = eval_logic[(eval_logic['type_label'] == 'Logical step.') & (eval_logic['logic_relevance_label'] == 'Relevant') & (~eval_logic['correctness_label'].isna())]
eval_logic['history'] = eval_logic.apply(_make_history, axis=1)
# Fields:
# Premise: [history]
# Hypothesis: [step]
# Gold label: [correctness_label]
# Step Relevance task
eval_relevance = reveal_eval.drop_duplicates(subset=['answer_id', 'step_idx']).reset_index(drop=True)
eval_relevance['relevance_label'] = (eval_relevance['logic_relevance_label'] == 'Relevant') | (eval_relevance['attribution_relevance_label'] == 'Yes')
# Fields:
# Question: [question]
# Answer: [full_answer]
# Step: [step]
# Gold label: [relevance_label]
# Step Type task
eval_type = reveal_eval.drop_duplicates(subset=['answer_id', 'step_idx']).reset_index(drop=True)
# Fields:
# Question: [question]
# Answer: [full_answer]
# Step: [step]
# Gold label: [type_label]
# CoT Full Correctness task
# Get a list of the final rated evidence passages for each answer_id and concatenate the list into one string:
rated_evidence_per_answer = {
answer_id: reveal_eval[(reveal_eval.answer_id == answer_id) & reveal_eval.is_final_rated_evidence_for_step]['evidence']
for answer_id in reveal_eval['answer_id'].unique()
}
rated_evidence_per_answer = {
k: '\n'.join([f'Evidence {i+1}: {e}' for i, e in enumerate(v)]) for k, v in rated_evidence_per_answer.items()
}
# Prepare the eval DataFrame:
answer_correctness_eval = reveal_eval.drop_duplicates(subset=['answer_id']).reset_index(drop=True)
answer_correctness_eval['all_rated_evidence'] = answer_correctness_eval['answer_id'].apply(lambda x: rated_evidence_per_answer[x])
answer_correctness_eval = answer_correctness_eval[['answer_id','question','full_answer','all_rated_evidence','answer_is_fully_attributable','answer_is_logically_correct','answer_is_fully_attributable_and_correct']]
```
### **This is an evaluation benchmark. It should not be included in training data for NLP models.**
Please do not redistribute any part of the dataset without sufficient protection against web-crawlers.
An identifier 64-character string is added to each instance in the dataset to assist in future detection of contamination in web-crawl corporta.
The reveal dataset's string is: `Reveal:Mn12GAs2I3S0eWjbTUFC0Y51ijGFB7rGBLnzGGhCQ7OtJPfVg7e6qt9zb5RPL36U`
The same has been done to the few-shot prompting demonstrations, to detect whether these demonstrations have been in a model's training data (if so, these demonstrations should not be used for few-shot evaluation of that model).
The few-shot demonstrations' string is: `Reveal:HlyeWxw8BRcQ2dPGShTUUjn03uULZOyeNbzKzRIg4QihZ45k1lrye46OoUzi3kkW`
#### Fields and Descriptions
* **dataset**: Source dataset
* **question_id**: ID of the original instance from the source dataset
* **question**: The question text
* **answer_model**: Model which generated the CoT answer
* **answer_id**: ID of a particular model's answer to a question (question_id + answer_model)
* **step_idx**: Step index in the answer for this row
* **full_answer**: Full CoT answer generated by the model
* **step**: The step from the full CoT answer which matches "step_idx", the subject of the row
* **decontextualized_step**: The decontextualized version of the step that we used for evidence retrieval (and for the NLI classification evaluations settings)
* **attribution_relevance_label**: Majority label for the relevance annotations in the attribution task
* **attribution_relevance_majority**: Max # of raters which agreed with each other for this rating
* **attribution_relevance_annotations**: The annotations for each rater (ordered list)
* **attribution_relevance_raters**: The raters (ordered list)
* **attribution_relevance_num_ratings**: The number of raters/ratings
* **evidence_id**: The evidence id (from 1 to 3) used for the annotation in this row
* **evidence**: The evidence used for the annotation in this row
* **attribution_label**: The majority label for whether the evidence supports the step
* **attribution_majority**: Max # of raters which agreed with each other for this rating
* **attribution_annotations**: The annotations for each rater (ordered list)
* **attribution_raters**: The raters (ordered list)
* **attribution_num_ratings**: The number of raters/ratings
* **attribution_justifications**: The justifications of each rater (ordered list) - note that the raters gave one justification for every step, *not* for every evidence
* **annotated_in_attribution_batch**: Which batch this was annotated in (we had 5 annotation batches)
* **type_label**: Majority label for whether the step is an attribution step, logical step or both
* **type_majority**: Max # of raters which agreed with each other for this rating
* **type_annotations**: The annotations for each rater (ordered list)
* **type_raters**: The raters (ordered list)
* **type_num_ratings**: The number of raters/ratings
* **logic_relevance_label**: Majority label for relevance annotations in the logic task
* **logic_relevance_majority**: Max # of raters which agreed with each other for this rating
* **logic_relevance_annotations**: The annotations for each rater (ordered list)
* **logic_relevance_raters**: The raters (ordered list)
* **logic_relevance_num_ratings**: The number of raters/ratings
* **logic_justifications**: Justifications of each rater (ordered list) - note that the raters gave one justification to all ratings of every step (i.e., one justification for the ratings of type + relevance + correctness together)
* **annotated_in_logic_batch**: Which batch this was annotated in (we had 5 annotation batches)
* **correctness_label**: Majority label for whether the step is logically correct given the question + previous steps
* **correctness_majority**: Max # of raters which agreed with each other for this rating
* **correctness_annotations**: The annotations for each rater (ordered list)
* **correctness_raters**: The raters (ordered list)
* **correctness_num_ratings**: The number of raters/ratings
* **agreement_majority_all_steps**: Minimum agreement majority across the attribution and logic ratings for all steps
* **is_low_agreement_hard_case**: agreement_majority_all_steps <= 2. This boolean indicates whether the annotations for this answer contain a step with non-trustworthy annotations. This is the difference between Reveal-Eval and Reveal-Open.
* **contamination_identifier**: An identification string for contamination detection.
* **is_final_rated_evidence_for_step**: Whether this step-evidence pair is the final attribution rating for this step (we try 3 evidences, and stop when we find a supporting or contradicting evidence. The rating in this row is the final attribution rating for the ste pacross all evidence passages)
* **answer_is_fully_attributable**: Whether all attribution steps in the answer are fully attributable to some evidence
* **answer_is_logically_correct**: Whether all logic steps are logically correct
* **answer_is_fully_attributable_and_correct**: Whether all steps are correct (fully attributable or logical)
# Reveal:推理链验证器基准数据集
## [论文:《链状思考如其最薄弱环节一般:面向推理链验证器的基准数据集》](https://arxiv.org/abs/2402.00559)
链接:https://arxiv.org/abs/2402.00559
官网:https://reveal-dataset.github.io/
### 摘要
提示大语言模型(Large Language Model,LLM)生成分步式回答(例如“链状思考(Chain-of-Thought)”)是解决复杂推理任务的主流方案,而更精准的推理链通常能提升下游任务的性能表现。
现有研究已提出多种自动验证推理步骤的方法,以评估并提升推理步骤的正确性,但目前尚无细粒度的步骤级数据集可用于全面评估这类验证方法,这制约了该方向的研究进展。
我们提出Reveal:**推理验证评估(Reasoning Verification Evaluation)**,这是一款用于开放域问答场景下复杂链状思考推理自动验证器评测的全新基准数据集。
Reveal涵盖了来自多种公开数据集与当前主流大语言模型生成的回答中,每个推理步骤的相关性、证据段落归因性以及逻辑正确性的完整标注。
### 使用方式
#### 数据集加载
python
! pip install datasets
from datasets import load_dataset
reveal = load_dataset("google/reveal")
reveal_eval = reveal['eval'] # 选取评估划分集Reveal-Eval
reveal_open = reveal['open'] # 选取难例划分集Reveal-Open,该划分集包含低置信度标注
**注意:上述代码提供了从`eval/reveal_eval.csv`读取数据的方式,便于大规模处理数据集。若您偏好JSON格式,另有`eval/reveal_eval.json`文件,其结构更直观易懂。**
以下为通过派生步骤级任务处理数据集的示例代码:
python
import pandas as pd
reveal_eval = pd.DataFrame(reveal_eval)
# 步骤归因任务
eval_attr = reveal_eval[~reveal_eval.evidence.isna()].reset_index(drop=True)
eval_attr['decontextualized_step'] = eval_attr['decontextualized_step'].fillna(eval_attr['step'])
# 字段说明:
# 前提(Premise):[evidence]
# 假设(Hypothesis):[decontextualized_step]
# 金标准标签(Gold label):[attribution_label]
# 步骤逻辑任务
def _make_history(row):
return row['question'] + ' ' + row['full_answer'].split(row['step'].strip())[0]
eval_logic = reveal_eval.drop_duplicates(subset=['answer_id', 'step_idx']).reset_index(drop=True)
eval_logic = eval_logic[(eval_logic['type_label'] == 'Logical step.') & (eval_logic['logic_relevance_label'] == 'Relevant') & (~eval_logic['correctness_label'].isna())]
eval_logic['history'] = eval_logic.apply(_make_history, axis=1)
# 字段说明:
# 前提(Premise):[history]
# 假设(Hypothesis):[step]
# 金标准标签(Gold label):[correctness_label]
# 步骤相关性任务
eval_relevance = reveal_eval.drop_duplicates(subset=['answer_id', 'step_idx']).reset_index(drop=True)
eval_relevance['relevance_label'] = (eval_relevance['logic_relevance_label'] == 'Relevant') | (eval_relevance['attribution_relevance_label'] == 'Yes')
# 字段说明:
# 问题(Question):[question]
# 回答(Answer):[full_answer]
# 推理步骤(Step):[step]
# 金标准标签(Gold label):[relevance_label]
# 步骤类型任务
eval_type = reveal_eval.drop_duplicates(subset=['answer_id', 'step_idx']).reset_index(drop=True)
# 字段说明:
# 问题(Question):[question]
# 回答(Answer):[full_answer]
# 推理步骤(Step):[step]
# 金标准标签(Gold label):[type_label]
# 完整链状思考正确性任务
# 为每个answer_id获取最终标注的证据段落列表,并拼接为单个字符串:
rated_evidence_per_answer = {
answer_id: reveal_eval[(reveal_eval.answer_id == answer_id) & reveal_eval.is_final_rated_evidence_for_step]['evidence']
for answer_id in reveal_eval['answer_id'].unique()
}
rated_evidence_per_answer = {
k: '
'.join([f'Evidence {i+1}: {e}' for i, e in enumerate(v)]) for k, v in rated_evidence_per_answer.items()
}
# 准备评估数据集:
answer_correctness_eval = reveal_eval.drop_duplicates(subset=['answer_id']).reset_index(drop=True)
answer_correctness_eval['all_rated_evidence'] = answer_correctness_eval['answer_id'].apply(lambda x: rated_evidence_per_answer[x])
answer_correctness_eval = answer_correctness_eval[['answer_id','question','full_answer','all_rated_evidence','answer_is_fully_attributable','answer_is_logically_correct','answer_is_fully_attributable_and_correct']]
**重要说明:本数据集为评估基准数据集,不得用于自然语言处理(NLP)模型的训练数据。**
请在采取足够措施防范网络爬虫的前提下,方可对数据集的任何部分进行二次分发。
数据集中的每个样本均添加了64位字符串标识符,用于未来检测网络爬取语料中的数据污染问题。
数据集专属标识符为:`Reveal:Mn12GAs2I3S0eWjbTUFC0Y51ijGFB7rGBLnzGGhCQ7OtJPfVg7e6qt9zb5RPL36U`
少样本提示演示样本也采用了相同的处理方式,用于检测这些演示样本是否已出现在某模型的训练数据中(若已出现,则不得将其用于该模型的少样本评估)。
少样本演示样本专属标识符为:`Reveal:HlyeWxw8BRcQ2dPGShTUUjn03uULZOyeNbzKzRIg4QihZ45k1lrye46OoUzi3kkW`
#### 字段与说明
* **dataset**:数据集来源
* **question_id**:源数据集中原始样本的唯一标识符
* **question**:问题文本
* **answer_model**:生成链状思考回答的大语言模型
* **answer_id**:针对某一问题的特定模型回答的唯一标识符(由question_id与answer_model拼接而成)
* **step_idx**:当前样本对应回答中的推理步骤索引
* **full_answer**:模型生成的完整链状思考回答
* **step**:完整回答中与step_idx匹配的推理步骤,即当前样本的研究对象
* **decontextualized_step**:经去上下文处理后的推理步骤,用于证据检索与自然语言推理(NLI)分类评估任务
* **attribution_relevance_label**:归因任务中步骤相关性标注的多数票结果
* **attribution_relevance_majority**:本次标注中达成一致的最大评注者人数
* **attribution_relevance_annotations**:每位评注者的标注结果(按标注顺序排列的列表)
* **attribution_relevance_raters**:参与本次标注的评注者信息(按标注顺序排列的列表)
* **attribution_relevance_num_ratings**:参与本次标注的评注者总人数
* **evidence_id**:本次标注使用的证据段落ID(取值范围为1至3)
* **evidence**:本次标注使用的证据段落文本
* **attribution_label**:用于判断证据是否支持该推理步骤的标注多数票结果
* **attribution_majority**:本次归因标注中达成一致的最大评注者人数
* **attribution_annotations**:每位评注者的归因标注结果(按标注顺序排列的列表)
* **attribution_raters**:参与本次归因标注的评注者信息(按标注顺序排列的列表)
* **attribution_num_ratings**:参与本次归因标注的评注者总人数
* **attribution_justifications**:每位评注者的标注理由(按标注顺序排列的列表)—— 注:每位评注者仅为每个推理步骤提供一份理由,而非针对每个证据段落
* **annotated_in_attribution_batch**:本次归因标注所属的批次(本次标注共分为5个批次)
* **type_label**:用于判断推理步骤类型(归因步骤、逻辑步骤或两者兼具)的标注多数票结果
* **type_majority**:本次步骤类型标注中达成一致的最大评注者人数
* **type_annotations**:每位评注者的步骤类型标注结果(按标注顺序排列的列表)
* **type_raters**:参与本次步骤类型标注的评注者信息(按标注顺序排列的列表)
* **type_num_ratings**:参与本次步骤类型标注的评注者总人数
* **logic_relevance_label**:逻辑任务中步骤相关性标注的多数票结果
* **logic_relevance_majority**:本次逻辑相关性标注中达成一致的最大评注者人数
* **logic_relevance_annotations**:每位评注者的逻辑相关性标注结果(按标注顺序排列的列表)
* **logic_relevance_raters**:参与本次逻辑相关性标注的评注者信息(按标注顺序排列的列表)
* **logic_relevance_num_ratings**:参与本次逻辑相关性标注的评注者总人数
* **logic_justifications**:每位评注者的标注理由(按标注顺序排列的列表)—— 注:每位评注者仅为每个推理步骤的所有标注(类型、相关性与正确性)提供一份理由
* **annotated_in_logic_batch**:本次逻辑标注所属的批次(本次标注共分为5个批次)
* **correctness_label**:用于判断结合问题与前置步骤后,该推理步骤是否逻辑正确的标注多数票结果
* **correctness_majority**:本次逻辑正确性标注中达成一致的最大评注者人数
* **correctness_annotations**:每位评注者的逻辑正确性标注结果(按标注顺序排列的列表)
* **correctness_raters**:参与本次逻辑正确性标注的评注者信息(按标注顺序排列的列表)
* **correctness_num_ratings**:参与本次逻辑正确性标注的评注者总人数
* **agreement_majority_all_steps**:所有推理步骤的归因与逻辑标注中,最小的一致性多数票数值
* **is_low_agreement_hard_case**:当agreement_majority_all_steps ≤ 2时为真。该布尔值用于指示该回答的标注中是否存在可信度较低的推理步骤标注,这也是Reveal-Eval与Reveal-Open的区别所在
* **contamination_identifier**:用于数据污染检测的专属标识符字符串
* **is_final_rated_evidence_for_step**:该样本对应的步骤-证据对是否为该推理步骤的最终归因标注(我们最多尝试3个证据段落,当找到支持或矛盾的证据后即停止标注。本样本中的标注即为该步骤在所有证据段落下的最终归因结果)
* **answer_is_fully_attributable**:该回答的所有推理步骤是否均可归因至对应的证据段落
* **answer_is_logically_correct**:该回答的所有推理步骤是否均逻辑正确
* **answer_is_fully_attributable_and_correct**:该回答的所有推理步骤是否均符合要求(即可归因或逻辑正确)
提供机构:
maas
创建时间:
2025-04-21
搜集汇总
数据集介绍

背景与挑战
背景概述
Reveal是一个用于评估推理链验证器的基准数据集,提供详细的推理步骤标签,包括相关性、证据归因和逻辑正确性。它适用于开放域问答场景,旨在促进自动验证方法的研究和评估。
以上内容由遇见数据集搜集并总结生成



