five

crv

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/facebook/crv
下载链接
链接失效反馈
官方服务:
资源简介:
This repository contains the datasets used in the paper **"Verifying Chain-of-Thought Reasoning via its Computational Graph"** (CRV). The data consists of reasoning and math problems, the Chain-of-Thought (CoT) traces generated by **Llama 3.1 8B Instruct**, and step-level correctness annotations used to train and evaluate the CRV verifier. **Paper:** [arXiv:2510.09312](https://arxiv.org/abs/2510.09312) **Codebase:** [GitHub Repository](https://github.com/facebookresearch/CRV) ## Dataset Overview We provide data for three domains. For the arithmetic and boolean domain, the data is organized into files based on complexity (number of operators). ### Domains 1. **Synthetic Arithmetic:** Nested integer expressions involving addition, multiplication, and unary minus operators. 2. **Synthetic Boolean:** Nested logical expressions over truth values 'True' and 'False' involving boolean operators (`and`, `or`, `not`). 3. **GSM8K:** The test split of the GSM8K benchmark, with generated CoT traces. ### Data Types For each domain, we provide two types of files: 1. **Raw CoT Generations:** Contains the original expression/question and the CoT response generated by Llama 3.1 8B Instruct. The CoTs are pre-segmented into individual reasoning steps. 2. **Annotated Data:** Contains the same CoTs but with **step-level correctness labels** (Correct/Incorrect). These labels were generated using the consensus pipeline described in the paper (LLM-as-a-Judge + Programmatic Verification). **Note on Dataset Size:** You may notice that the annotated datasets (e.g., arith.nt7.annotated.json) for synthetic tasks contain fewer samples than the raw datasets (e.g., arith.nd10000.nt7.json). This is intentional. To ensure high-quality, reliable labels, we employed a dual-verification strategy using both an LLM-as-a-Judge and a Programmatic Verifier. We applied a strict intersection policy: only reasoning steps where both methods agreed on the correctness label were retained. Refer to **Appendix A.2** of the paper for details. --- ## File Structure and Naming Convention The files are located in their respective folders (`arithmetic_expressions/`, `boolean_expressions/`, `gsm8k_expressions/`). The filenames follow a specific convention indicating the complexity and size of the dataset. **Format:** `[domain].nd[count].nt[operators].[type].json` * **`nd` (Number of Dialogs):** The number of expressions/samples in the file (e.g., `nd10000` = 10,000 samples). * **`nt` (Non-Terminal Nodes):** The complexity of the expression, defined by the number of operators (e.g., `nt7` = 7 operators). * **`.annotated`:** If present, this file contains step-level labels. ### Examples * **`arith.nd10000.nt7.json`**: * **Domain:** Arithmetic * **Size:** 10,000 samples * **Complexity:** 7 operators * **Content:** Questions and segmented CoTs (Unlabeled). * **`arith.nt7.annotated.json`**: * **Domain:** Arithmetic * **Complexity:** 7 operators * **Content:** Questions, segmented CoTs, and **step-level correctness labels**. --- ## Data Fields The dataset schema differs slightly between the **Raw CoT Generations** (unlabeled) and the **Annotated Data** (labeled). **1. Raw CoT Generations** *(e.g., `arith.nd10000.nt7.json`)* These files contain the generated CoT traces segmented into steps, but without correctness labels. - `role` (string): The role of the message sender (e.g., "assistant"). - `content` (string): The full, unsegmented Chain-of-Thought text generated by the model. - `predicted_truth_value`/`predicted_value` (boolean/number): The final answer derived from the generated CoT. - `step_level` (list): A list of objects representing the segmented steps. - step_number (int): The index of the reasoning step (0-indexed). - content (string): The specific text content of that reasoning step. **2. Annotated Data** *(e.g., `arith.nt7.annotated.json`)* These files contain the full context required to replicate the attribution process, including formatted prompts with special tokens and step-level correctness labels. - `expression_id` (int): A unique identifier for the problem/expression. - `original_expression` (string): The input problem text (e.g., the math problem or boolean/arithmetic expression). - `correct_value` (boolean/number): The ground truth answer. - `predicted_value` (boolean/number): The answer predicted by the model in this specific trace. - `total_steps` (int): The total number of reasoning steps in the solution. - `step_expressions` (list): A list of objects containing detailed annotations for each step: - `step_number` (int): The index of the step. - `step_content` (string): The text content of the current step. - `step_label` (boolean): The correctness label (`true` for Correct, `false` for Incorrect). - `assistant_content_before` (string): The cumulative CoT text **preceding** the current step. - `assistant_content_after` (string): The cumulative CoT text **including** the current step. - `formatted_assistant_content_before` (string): The exact prompt string used to generate the current step, including Llama 3 special tokens (e.g., <|start_header_id|>, <|eot_id|>). This is critical for exact state reconstruction during attribution. - `formatted_assistant_content_after` (string): The exact prompt string representing the state after the step is generated. --- ## Citation If you use this dataset, please cite our paper: ```bibtex @article{zhao2025verifying, title={Verifying Chain-of-Thought Reasoning via Its Computational Graph}, author={Zheng Zhao and Yeskendir Koishekenov and Xianjun Yang and Naila Murray and Nicola Cancedda}, year={2025}, eprint={2510.09312}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2510.09312}, } ```

本仓库包含论文**《通过计算图验证思维链推理》(Verifying Chain-of-Thought Reasoning via its Computational Graph,简称CRV)**所使用的数据集。 该数据集涵盖推理与数学问题、由**Llama 3.1 8B Instruct**生成的思维链(Chain-of-Thought,CoT)轨迹,以及用于训练和评估CRV验证器的步骤级正确性标注。 **论文:** [arXiv:2510.09312](https://arxiv.org/abs/2510.09312) **代码库:** [GitHub 仓库](https://github.com/facebookresearch/CRV) ## 数据集概览 我们提供三个领域的数据集。对于算术与布尔领域,数据集按照复杂度(运算符数量)进行组织。 ### 领域 1. **合成算术(Synthetic Arithmetic):** 包含加法、乘法与一元负运算符的嵌套整数表达式。 2. **合成布尔(Synthetic Boolean):** 基于真值“真”与“假”的嵌套逻辑表达式,涵盖布尔运算符(`and`、`or`、`not`)。 3. **GSM8K:** GSM8K基准测试的测试分割集,附带生成的CoT轨迹。 ### 数据类型 针对每个领域,我们提供两类文件: 1. **原始CoT生成结果:** 包含原始表达式/问题与Llama 3.1 8B Instruct生成的CoT响应。所有CoT已预先分割为独立的推理步骤。 2. **带标注数据:** 包含相同的CoT轨迹,但附带**步骤级正确性标签**(正确/错误)。这些标签通过论文中描述的共识流程生成(即大语言模型(Large Language Model,LLM)作为评判器+程序化验证)。 **数据集规模说明:** 您可能会注意到,合成任务的带标注数据集(例如`arith.nt7.annotated.json`)的样本数量少于原始数据集(例如`arith.nd10000.nt7.json`),这是有意为之。 为确保获得高质量、可靠的标签,我们采用了双验证策略:同时使用大语言模型作为评判器与程序化验证器。我们应用了严格的交集规则:仅保留两种方法对正确性标签达成一致的推理步骤。详细信息请参阅论文的附录A.2。 --- ## 文件结构与命名规范 数据集文件存储在各自的文件夹中(`arithmetic_expressions/`、`boolean_expressions/`、`gsm8k_expressions/`)。文件名遵循特定的约定,用于标识数据集的复杂度与规模。 **命名格式:** `[domain].nd[count].nt[operators].[type].json` - **`nd`(对话数量,Number of Dialogs):** 文件中的表达式/样本数量(例如`nd10000`代表10,000个样本)。 - **`nt`(非终结节点数,Non-Terminal Nodes):** 表达式的复杂度,由运算符数量定义(例如`nt7`代表7个运算符)。 - **`.annotated`:** 若文件名包含该字段,则该文件附带步骤级标签。 ### 示例 - **`arith.nd10000.nt7.json`:** - **领域:** 算术 - **规模:** 10,000个样本 - **复杂度:** 7个运算符 - **内容:** 问题与分割后的CoT轨迹(无标注)。 - **`arith.nt7.annotated.json`:** - **领域:** 算术 - **复杂度:** 7个运算符 - **内容:** 问题、分割后的CoT轨迹与**步骤级正确性标签**。 --- ## 数据字段 数据集的模式在**原始CoT生成结果(无标注)**与**带标注数据(有标注)**之间略有差异。 **1. 原始CoT生成结果** (例如`arith.nd10000.nt7.json`) 此类文件包含分割为步骤的生成CoT轨迹,但未附带正确性标签。 - `role`(字符串):消息发送者的角色(例如“assistant”,即助手)。 - `content`(字符串):模型生成的完整、未分割的思维链文本。 - `predicted_truth_value`/`predicted_value`(布尔值/数值):由生成的CoT推导得到的最终答案。 - `step_level`(列表):代表分割后步骤的对象列表。 - `step_number`(整数):推理步骤的索引(从0开始)。 - `content`(字符串):该推理步骤的具体文本内容。 **2. 带标注数据** (例如`arith.nt7.annotated.json`) 此类文件包含复现归因流程所需的完整上下文,包括带有特殊标记的格式化提示词与步骤级正确性标签。 - `expression_id`(整数):问题/表达式的唯一标识符。 - `original_expression`(字符串):输入问题文本(例如数学问题或布尔/算术表达式)。 - `correct_value`(布尔值/数值):标准答案。 - `predicted_value`(布尔值/数值):该轨迹中模型预测的答案。 - `total_steps`(整数):解决方案中的总推理步骤数。 - `step_expressions`(列表):包含每个步骤详细标注的对象列表: - `step_number`(整数):步骤的索引。 - `step_content`(字符串):当前步骤的文本内容。 - `step_label`(布尔值):正确性标签(`true`代表正确,`false`代表错误)。 - `assistant_content_before`(字符串):当前步骤之前的累积CoT文本。 - `assistant_content_after`(字符串):包含当前步骤的累积CoT文本。 - `formatted_assistant_content_before`(字符串):用于生成当前步骤的精确提示字符串,包含Llama 3的特殊标记(例如`<|start_header_id|>`、`<|eot_id|>`),这对于归因过程中的精确状态重构至关重要。 - `formatted_assistant_content_after`(字符串):代表当前步骤生成后状态的精确提示字符串。 --- ## 引用 若您使用该数据集,请引用我们的论文: bibtex @article{zhao2025verifying, title={Verifying Chain-of-Thought Reasoning via Its Computational Graph}, author={Zheng Zhao and Yeskendir Koishekenov and Xianjun Yang and Naila Murray and Nicola Cancedda}, year={2025}, eprint={2510.09312}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2510.09312}, }
提供机构:
maas
创建时间:
2025-11-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作