crv
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/facebook/crv
下载链接
链接失效反馈官方服务:
资源简介:
This repository contains the datasets used in the paper **"Verifying Chain-of-Thought Reasoning via its Computational Graph"** (CRV).
The data consists of reasoning and math problems, the Chain-of-Thought (CoT) traces generated by **Llama 3.1 8B Instruct**, and step-level correctness annotations used to train and evaluate the CRV verifier.
**Paper:** [arXiv:2510.09312](https://arxiv.org/abs/2510.09312)
**Codebase:** [GitHub Repository](https://github.com/facebookresearch/CRV)
## Dataset Overview
We provide data for three domains. For the arithmetic and boolean domain, the data is organized into files based on complexity (number of operators).
### Domains
1. **Synthetic Arithmetic:** Nested integer expressions involving addition, multiplication, and unary minus operators.
2. **Synthetic Boolean:** Nested logical expressions over truth values 'True' and 'False' involving boolean operators (`and`, `or`, `not`).
3. **GSM8K:** The test split of the GSM8K benchmark, with generated CoT traces.
### Data Types
For each domain, we provide two types of files:
1. **Raw CoT Generations:** Contains the original expression/question and the CoT response generated by Llama 3.1 8B Instruct. The CoTs are pre-segmented into individual reasoning steps.
2. **Annotated Data:** Contains the same CoTs but with **step-level correctness labels** (Correct/Incorrect). These labels were generated using the consensus pipeline described in the paper (LLM-as-a-Judge + Programmatic Verification).
**Note on Dataset Size:**
You may notice that the annotated datasets (e.g., arith.nt7.annotated.json) for synthetic tasks contain fewer samples than the raw datasets (e.g., arith.nd10000.nt7.json). This is intentional.
To ensure high-quality, reliable labels, we employed a dual-verification strategy using both an LLM-as-a-Judge and a Programmatic Verifier. We applied a strict intersection policy: only reasoning steps where both methods agreed on the correctness label were retained. Refer to **Appendix A.2** of the paper for details.
---
## File Structure and Naming Convention
The files are located in their respective folders (`arithmetic_expressions/`, `boolean_expressions/`, `gsm8k_expressions/`). The filenames follow a specific convention indicating the complexity and size of the dataset.
**Format:** `[domain].nd[count].nt[operators].[type].json`
* **`nd` (Number of Dialogs):** The number of expressions/samples in the file (e.g., `nd10000` = 10,000 samples).
* **`nt` (Non-Terminal Nodes):** The complexity of the expression, defined by the number of operators (e.g., `nt7` = 7 operators).
* **`.annotated`:** If present, this file contains step-level labels.
### Examples
* **`arith.nd10000.nt7.json`**:
* **Domain:** Arithmetic
* **Size:** 10,000 samples
* **Complexity:** 7 operators
* **Content:** Questions and segmented CoTs (Unlabeled).
* **`arith.nt7.annotated.json`**:
* **Domain:** Arithmetic
* **Complexity:** 7 operators
* **Content:** Questions, segmented CoTs, and **step-level correctness labels**.
---
## Data Fields
The dataset schema differs slightly between the **Raw CoT Generations** (unlabeled) and the **Annotated Data** (labeled).
**1. Raw CoT Generations**
*(e.g., `arith.nd10000.nt7.json`)*
These files contain the generated CoT traces segmented into steps, but without correctness labels.
- `role` (string): The role of the message sender (e.g., "assistant").
- `content` (string): The full, unsegmented Chain-of-Thought text generated by the model.
- `predicted_truth_value`/`predicted_value` (boolean/number): The final answer derived from the generated CoT.
- `step_level` (list): A list of objects representing the segmented steps.
- step_number (int): The index of the reasoning step (0-indexed).
- content (string): The specific text content of that reasoning step.
**2. Annotated Data**
*(e.g., `arith.nt7.annotated.json`)*
These files contain the full context required to replicate the attribution process, including formatted prompts with special tokens and step-level correctness labels.
- `expression_id` (int): A unique identifier for the problem/expression.
- `original_expression` (string): The input problem text (e.g., the math problem or boolean/arithmetic expression).
- `correct_value` (boolean/number): The ground truth answer.
- `predicted_value` (boolean/number): The answer predicted by the model in this specific trace.
- `total_steps` (int): The total number of reasoning steps in the solution.
- `step_expressions` (list): A list of objects containing detailed annotations for each step:
- `step_number` (int): The index of the step.
- `step_content` (string): The text content of the current step.
- `step_label` (boolean): The correctness label (`true` for Correct, `false` for Incorrect).
- `assistant_content_before` (string): The cumulative CoT text **preceding** the current step.
- `assistant_content_after` (string): The cumulative CoT text **including** the current step.
- `formatted_assistant_content_before` (string): The exact prompt string used to generate the current step, including Llama 3 special tokens (e.g., <|start_header_id|>, <|eot_id|>). This is critical for exact state reconstruction during attribution.
- `formatted_assistant_content_after` (string): The exact prompt string representing the state after the step is generated.
---
## Citation
If you use this dataset, please cite our paper:
```bibtex
@article{zhao2025verifying,
title={Verifying Chain-of-Thought Reasoning via Its Computational Graph},
author={Zheng Zhao and Yeskendir Koishekenov and Xianjun Yang and Naila Murray and Nicola Cancedda},
year={2025},
eprint={2510.09312},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.09312},
}
```
本仓库包含论文**《通过计算图验证思维链推理》(Verifying Chain-of-Thought Reasoning via its Computational Graph,简称CRV)**所使用的数据集。
该数据集涵盖推理与数学问题、由**Llama 3.1 8B Instruct**生成的思维链(Chain-of-Thought,CoT)轨迹,以及用于训练和评估CRV验证器的步骤级正确性标注。
**论文:** [arXiv:2510.09312](https://arxiv.org/abs/2510.09312)
**代码库:** [GitHub 仓库](https://github.com/facebookresearch/CRV)
## 数据集概览
我们提供三个领域的数据集。对于算术与布尔领域,数据集按照复杂度(运算符数量)进行组织。
### 领域
1. **合成算术(Synthetic Arithmetic):** 包含加法、乘法与一元负运算符的嵌套整数表达式。
2. **合成布尔(Synthetic Boolean):** 基于真值“真”与“假”的嵌套逻辑表达式,涵盖布尔运算符(`and`、`or`、`not`)。
3. **GSM8K:** GSM8K基准测试的测试分割集,附带生成的CoT轨迹。
### 数据类型
针对每个领域,我们提供两类文件:
1. **原始CoT生成结果:** 包含原始表达式/问题与Llama 3.1 8B Instruct生成的CoT响应。所有CoT已预先分割为独立的推理步骤。
2. **带标注数据:** 包含相同的CoT轨迹,但附带**步骤级正确性标签**(正确/错误)。这些标签通过论文中描述的共识流程生成(即大语言模型(Large Language Model,LLM)作为评判器+程序化验证)。
**数据集规模说明:**
您可能会注意到,合成任务的带标注数据集(例如`arith.nt7.annotated.json`)的样本数量少于原始数据集(例如`arith.nd10000.nt7.json`),这是有意为之。
为确保获得高质量、可靠的标签,我们采用了双验证策略:同时使用大语言模型作为评判器与程序化验证器。我们应用了严格的交集规则:仅保留两种方法对正确性标签达成一致的推理步骤。详细信息请参阅论文的附录A.2。
---
## 文件结构与命名规范
数据集文件存储在各自的文件夹中(`arithmetic_expressions/`、`boolean_expressions/`、`gsm8k_expressions/`)。文件名遵循特定的约定,用于标识数据集的复杂度与规模。
**命名格式:** `[domain].nd[count].nt[operators].[type].json`
- **`nd`(对话数量,Number of Dialogs):** 文件中的表达式/样本数量(例如`nd10000`代表10,000个样本)。
- **`nt`(非终结节点数,Non-Terminal Nodes):** 表达式的复杂度,由运算符数量定义(例如`nt7`代表7个运算符)。
- **`.annotated`:** 若文件名包含该字段,则该文件附带步骤级标签。
### 示例
- **`arith.nd10000.nt7.json`:**
- **领域:** 算术
- **规模:** 10,000个样本
- **复杂度:** 7个运算符
- **内容:** 问题与分割后的CoT轨迹(无标注)。
- **`arith.nt7.annotated.json`:**
- **领域:** 算术
- **复杂度:** 7个运算符
- **内容:** 问题、分割后的CoT轨迹与**步骤级正确性标签**。
---
## 数据字段
数据集的模式在**原始CoT生成结果(无标注)**与**带标注数据(有标注)**之间略有差异。
**1. 原始CoT生成结果**
(例如`arith.nd10000.nt7.json`)
此类文件包含分割为步骤的生成CoT轨迹,但未附带正确性标签。
- `role`(字符串):消息发送者的角色(例如“assistant”,即助手)。
- `content`(字符串):模型生成的完整、未分割的思维链文本。
- `predicted_truth_value`/`predicted_value`(布尔值/数值):由生成的CoT推导得到的最终答案。
- `step_level`(列表):代表分割后步骤的对象列表。
- `step_number`(整数):推理步骤的索引(从0开始)。
- `content`(字符串):该推理步骤的具体文本内容。
**2. 带标注数据**
(例如`arith.nt7.annotated.json`)
此类文件包含复现归因流程所需的完整上下文,包括带有特殊标记的格式化提示词与步骤级正确性标签。
- `expression_id`(整数):问题/表达式的唯一标识符。
- `original_expression`(字符串):输入问题文本(例如数学问题或布尔/算术表达式)。
- `correct_value`(布尔值/数值):标准答案。
- `predicted_value`(布尔值/数值):该轨迹中模型预测的答案。
- `total_steps`(整数):解决方案中的总推理步骤数。
- `step_expressions`(列表):包含每个步骤详细标注的对象列表:
- `step_number`(整数):步骤的索引。
- `step_content`(字符串):当前步骤的文本内容。
- `step_label`(布尔值):正确性标签(`true`代表正确,`false`代表错误)。
- `assistant_content_before`(字符串):当前步骤之前的累积CoT文本。
- `assistant_content_after`(字符串):包含当前步骤的累积CoT文本。
- `formatted_assistant_content_before`(字符串):用于生成当前步骤的精确提示字符串,包含Llama 3的特殊标记(例如`<|start_header_id|>`、`<|eot_id|>`),这对于归因过程中的精确状态重构至关重要。
- `formatted_assistant_content_after`(字符串):代表当前步骤生成后状态的精确提示字符串。
---
## 引用
若您使用该数据集,请引用我们的论文:
bibtex
@article{zhao2025verifying,
title={Verifying Chain-of-Thought Reasoning via Its Computational Graph},
author={Zheng Zhao and Yeskendir Koishekenov and Xianjun Yang and Naila Murray and Nicola Cancedda},
year={2025},
eprint={2510.09312},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.09312},
}
提供机构:
maas
创建时间:
2025-11-29



