five

facebook/crv

收藏
Hugging Face2025-11-28 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/facebook/crv
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 language: - en tags: - reasoning - math - logic size_categories: - 100K<n<1M --- This repository contains the datasets used in the paper **"Verifying Chain-of-Thought Reasoning via its Computational Graph"** (CRV). The data consists of reasoning and math problems, the Chain-of-Thought (CoT) traces generated by **Llama 3.1 8B Instruct**, and step-level correctness annotations used to train and evaluate the CRV verifier. **Paper:** [arXiv:2510.09312](https://arxiv.org/abs/2510.09312) **Codebase:** [GitHub Repository](https://github.com/facebookresearch/CRV) ## Dataset Overview We provide data for three domains. For the arithmetic and boolean domain, the data is organized into files based on complexity (number of operators). ### Domains 1. **Synthetic Arithmetic:** Nested integer expressions involving addition, multiplication, and unary minus operators. 2. **Synthetic Boolean:** Nested logical expressions over truth values 'True' and 'False' involving boolean operators (`and`, `or`, `not`). 3. **GSM8K:** The test split of the GSM8K benchmark, with generated CoT traces. ### Data Types For each domain, we provide two types of files: 1. **Raw CoT Generations:** Contains the original expression/question and the CoT response generated by Llama 3.1 8B Instruct. The CoTs are pre-segmented into individual reasoning steps. 2. **Annotated Data:** Contains the same CoTs but with **step-level correctness labels** (Correct/Incorrect). These labels were generated using the consensus pipeline described in the paper (LLM-as-a-Judge + Programmatic Verification). **Note on Dataset Size:** You may notice that the annotated datasets (e.g., arith.nt7.annotated.json) for synthetic tasks contain fewer samples than the raw datasets (e.g., arith.nd10000.nt7.json). This is intentional. To ensure high-quality, reliable labels, we employed a dual-verification strategy using both an LLM-as-a-Judge and a Programmatic Verifier. We applied a strict intersection policy: only reasoning steps where both methods agreed on the correctness label were retained. Refer to **Appendix A.2** of the paper for details. --- ## File Structure and Naming Convention The files are located in their respective folders (`arithmetic_expressions/`, `boolean_expressions/`, `gsm8k_expressions/`). The filenames follow a specific convention indicating the complexity and size of the dataset. **Format:** `[domain].nd[count].nt[operators].[type].json` * **`nd` (Number of Dialogs):** The number of expressions/samples in the file (e.g., `nd10000` = 10,000 samples). * **`nt` (Non-Terminal Nodes):** The complexity of the expression, defined by the number of operators (e.g., `nt7` = 7 operators). * **`.annotated`:** If present, this file contains step-level labels. ### Examples * **`arith.nd10000.nt7.json`**: * **Domain:** Arithmetic * **Size:** 10,000 samples * **Complexity:** 7 operators * **Content:** Questions and segmented CoTs (Unlabeled). * **`arith.nt7.annotated.json`**: * **Domain:** Arithmetic * **Complexity:** 7 operators * **Content:** Questions, segmented CoTs, and **step-level correctness labels**. --- ## Data Fields The dataset schema differs slightly between the **Raw CoT Generations** (unlabeled) and the **Annotated Data** (labeled). **1. Raw CoT Generations** *(e.g., `arith.nd10000.nt7.json`)* These files contain the generated CoT traces segmented into steps, but without correctness labels. - `role` (string): The role of the message sender (e.g., "assistant"). - `content` (string): The full, unsegmented Chain-of-Thought text generated by the model. - `predicted_truth_value`/`predicted_value` (boolean/number): The final answer derived from the generated CoT. - `step_level` (list): A list of objects representing the segmented steps. - step_number (int): The index of the reasoning step (0-indexed). - content (string): The specific text content of that reasoning step. **2. Annotated Data** *(e.g., `arith.nt7.annotated.json`)* These files contain the full context required to replicate the attribution process, including formatted prompts with special tokens and step-level correctness labels. - `expression_id` (int): A unique identifier for the problem/expression. - `original_expression` (string): The input problem text (e.g., the math problem or boolean/arithmetic expression). - `correct_value` (boolean/number): The ground truth answer. - `predicted_value` (boolean/number): The answer predicted by the model in this specific trace. - `total_steps` (int): The total number of reasoning steps in the solution. - `step_expressions` (list): A list of objects containing detailed annotations for each step: - `step_number` (int): The index of the step. - `step_content` (string): The text content of the current step. - `step_label` (boolean): The correctness label (`true` for Correct, `false` for Incorrect). - `assistant_content_before` (string): The cumulative CoT text **preceding** the current step. - `assistant_content_after` (string): The cumulative CoT text **including** the current step. - `formatted_assistant_content_before` (string): The exact prompt string used to generate the current step, including Llama 3 special tokens (e.g., <|start_header_id|>, <|eot_id|>). This is critical for exact state reconstruction during attribution. - `formatted_assistant_content_after` (string): The exact prompt string representing the state after the step is generated. --- ## Citation If you use this dataset, please cite our paper: ```bibtex @article{zhao2025verifying, title={Verifying Chain-of-Thought Reasoning via Its Computational Graph}, author={Zheng Zhao and Yeskendir Koishekenov and Xianjun Yang and Naila Murray and Nicola Cancedda}, year={2025}, eprint={2510.09312}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2510.09312}, } ```
提供机构:
facebook
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作