ProverQA
收藏魔搭社区2026-01-02 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/OpenDataLab/ProverQA
下载链接
链接失效反馈官方服务:
资源简介:
# ProverQA: A First-Order Logic Reasoning Dataset
This repository contains the dataset described in the paper [Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation](https://openreview.net/forum?id=C25SgeXWjE).
Code: https://github.com/opendatalab/ProverGen
## Dataset Description
ProverQA is a high-quality First-Order Logic (FOL) reasoning dataset created using the ProverGen framework, which synergizes the generative strengths of Large Language Models (LLMs) with the rigor and precision of symbolic provers. The dataset is designed to evaluate and improve the logical reasoning capabilities of language models, particularly in chain-of-thought (CoT) contexts.
### Key Features
- **Scalable**: Generated using an automated framework enabling expansion with minimal manual intervention
- **Natural and Diverse Language**: Captures a wide range of natural language expressions to reflect real-world linguistic variability
- **Symbolic Representations**: Includes formal symbolic structures validated through automated symbolic provers (Prover9)
- **Faithful Reasoning Chains**: Each instance includes intermediate reasoning steps clearly articulated in both symbolic and natural language formats
## Dataset Structure
The dataset consists of two main parts:
### 1. Development Set (`dev/`)
The benchmark evaluation set containing **1,500 instances** across three difficulty levels:
- **Easy**: 500 instances (1-2 reasoning steps)
- **Medium**: 500 instances (3-5 reasoning steps)
- **Hard**: 500 instances (6-9 reasoning steps)
### 2. Training Set (`train/`)
Contains **5,000 instances** used for finetuning experiments, generated using the same ProverGen framework with data augmentation techniques.
## Data Format
### Development Set Example (`dev/easy.json`)
```json
{
"id": 0,
"options": [
"A) True",
"B) False",
"C) Uncertain"
],
"answer": "B",
"question": "Based on the above information, is the following statement true, false, or uncertain? Brecken has never experienced heartbreak.",
"reasoning": "fact1: Brecken has experienced heartbreak.\nrule: Either Brecken has experienced heartbreak or he has never experienced heartbreak, but not both.\nconclusion: Brecken has experienced heartbreak.\n\nTherefore, it is false that Brecken has never experienced heartbreak. The correct option is: B.",
"context": "Brecken has experienced heartbreak. Either Brecken has experienced heartbreak or he has never experienced heartbreak, but not both.",
"nl2fol": {
"Brecken has experienced heartbreak.": "has_experienced_heartbreak(Brecken)",
"Either Brecken has experienced heartbreak or he has never experienced heartbreak, but not both.": "has_experienced_heartbreak(Brecken) ⊕ has_never_experienced_heartbreak(Brecken)"
},
"conclusion_fol": "has_never_experienced_heartbreak(Brecken)"
}
```
**Field Descriptions:**
- `id`: Unique identifier for the instance
- `options`: Multiple choice options (True/False/Uncertain)
- `answer`: Correct answer (A/B/C)
- `question`: The reasoning question to be answered
- `reasoning`: Step-by-step logical reasoning chain
- `context`: Background premises and rules
- `nl2fol`: Natural language to first-order logic mappings
- `conclusion_fol`: The conclusion in first-order logic format
### Training Set Example (`train/provergen-5000.json`)
```json
{
"system": "Given a problem statement as contexts, the task is to answer a logical reasoning question. Your answer should be in JSON format with keys: reasoning, answer.",
"output": "{\n \"reasoning\": \"Jayce pursues historical interests. Jayce does not develop expertise. If Jayce pursues historical interests, then he either gains practical skills or develops expertise (or both). Jayce gains practical skills...\",\n \"answer\": \"C\"\n}",
"input": "The correct option is:",
"instruction": "Context:\nJayce earns academic credentials. If someone earns academic credentials and gains practical skills, then they can participate in excavations...\n\nQuestion: Based on the above information, is the following statement true, false, or uncertain? If Jayce studies Seljuk history or explores Anatolian landscapes, then he uncovers hidden archaeological treasures.\n\nOptions:\nA) True\nB) False\nC) Uncertain\n"
}
```
**Field Descriptions:**
- `system`: System prompt for the model
- `instruction`: Full problem context, question, and options
- `input`: Input prompt continuation
- `output`: Expected model response with reasoning and answer
## FOL Coverage
The dataset covers all seven First-Order Logic relationships:
- **Conjunction** (∧): AND operations
- **Disjunction** (∨): OR operations
- **Negation** (¬): NOT operations
- **Implication** (→): IF-THEN relationships
- **Exclusive Disjunction** (⊕): XOR operations
- **Universal Quantifier** (∀): FOR ALL statements
- **Existential Quantifier** (∃): THERE EXISTS statements
## Usage
### Loading the Dataset
```python
from datasets import load_dataset
# Load development set
dev_dataset = load_dataset("opendatalab/ProverQA", data_files="dev/*.json")
# Load training set
train_dataset = load_dataset("opendatalab/ProverQA", data_files="train/provergen-5000.json")
# Load specific difficulty level
easy_dev = load_dataset("opendatalab/ProverQA", data_files="dev/easy.json")
medium_dev = load_dataset("opendatalab/ProverQA", data_files="dev/medium.json")
hard_dev = load_dataset("opendatalab/ProverQA", data_files="dev/hard.json")
```
### Evaluation
The dataset supports both standard prompting and chain-of-thought (CoT) prompting evaluation strategies. Models are evaluated on their ability to:
1. Parse complex logical premises and rules
2. Perform multi-step reasoning
3. Handle distracting information
4. Determine whether conclusions are True, False, or Uncertain
### Training
The training set can be used for:
- Finetuning language models on FOL reasoning
- Improving chain-of-thought reasoning capabilities
- Enhancing logical inference abilities
## Benchmark Results
State-of-the-art models show significant room for improvement on ProverQA:
| Model | Easy | Medium | Hard |
|-------|------|--------|------|
| GPT-4o (CoT) | 94.2% | 79.4% | 50.0% |
| Claude-3.5-Sonnet (CoT) | 95.2% | 83.6% | 56.4% |
| Llama3.1-70B (CoT) | 90.4% | 73.2% | 46.8% |
| o1-preview-2024-09-12 | 89.8% | 78.8% | 66.2% |
| DeepSeek-R1 | 91.8% | 78.4% | 66.6% |
Even with chain-of-thought prompting, top models barely exceed 50% accuracy on the hard subset, demonstrating the challenging nature of this benchmark.
## Dataset Creation
ProverQA was created using the ProverGen framework through a three-stage process:
1. **Background Story Generation**: LLMs generate unique contextual stories guided by subject names and characteristic keywords
2. **Logic Skeleton Generation**: Symbolic provers (Prover9) construct reasoning trees using a novel top-down approach
3. **Statement Translation**: LLMs translate logical expressions into natural language within the established context
## Citation
If you use ProverQA in your research, please cite:
```bibtex
@inproceedings{
qi2025large,
title={Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation},
author={Chengwen Qi and Ren Ma and Bowen Li and He Du and Binyuan Hui and Jinwang Wu and Yuanjun Laili and Conghui He},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=C25SgeXWjE}
}
```
## License and Ethics
- Dataset sources comply with public repository licenses (MIT for names, WordNet for keywords)
- No human participants involved in data collection
- Dataset does not contain harmful or biased content
- Designed for academic and research purposes in advancing logical reasoning capabilities
## Contact
For questions or issues regarding the dataset, please contact the authors or open an issue in the GitHub repository.
- Chengwen Qi (chengwen_qi@buaa.edu.cn)
- Ren Ma (maren@pjlab.org.cn)
- Bowen Li (libowen@pjlab.org.cn)
# ProverQA:一阶逻辑(First-Order Logic,FOL)推理数据集
本仓库包含论文《大语言模型与符号推理器结合用于逻辑推理评估》(Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation)所描述的数据集。
代码仓库:https://github.com/opendatalab/ProverGen
## 数据集概述
ProverQA是一款高质量的一阶逻辑推理数据集,基于ProverGen框架构建。该框架协同融合了大语言模型(Large Language Model,LLM)的生成能力与符号推理器的严谨性与精确性。本数据集旨在评估并提升语言模型的逻辑推理能力,尤其适用于思维链(Chain-of-Thought,CoT)场景下的测试。
### 核心特性
- **可扩展性**:采用自动化框架生成,仅需极少人工干预即可实现数据集扩展
- **语言自然多样**:涵盖丰富的自然语言表达形式,贴合真实世界的语言多样性
- **符号化表示**:包含经自动化符号推理器(Prover9)验证的形式化符号结构
- **推理链忠实可靠**:每个样本均包含以符号语言与自然语言两种形式清晰表述的中间推理步骤
## 数据集结构
数据集包含两大核心部分:
### 1. 开发集(`dev/`)
该基准评估集包含1500个样本,分为三个难度等级:
- **简单**:500个样本(需1-2步推理)
- **中等**:500个样本(需3-5步推理)
- **困难**:500个样本(需6-9步推理)
### 2. 训练集(`train/`)
包含5000个样本,用于微调实验,基于ProverGen框架结合数据增强技术生成。
## 数据格式
### 开发集示例(`dev/easy.json`)
json
{"id": 0, "options": ["A) 真", "B) 假", "C) 不确定"], "answer": "B", "question": "根据上述信息,以下陈述为真、假还是不确定?布雷肯从未经历过心碎。", "reasoning": "事实1:布雷肯经历过心碎。
规则:要么布雷肯经历过心碎,要么他从未经历过心碎,二者不可兼得。
结论:布雷肯经历过心碎。
因此,布雷肯从未经历过心碎的说法为假。正确选项为:B。", "context": "布雷肯经历过心碎。要么布雷肯经历过心碎,要么他从未经历过心碎,二者不可兼得。", "nl2fol": {"布雷肯经历过心碎。": "has_experienced_heartbreak(Brecken)", "要么布雷肯经历过心碎,要么他从未经历过心碎,二者不可兼得。": "has_experienced_heartbreak(Brecken) ⊕ has_never_experienced_heartbreak(Brecken)"}, "conclusion_fol": "has_never_experienced_heartbreak(Brecken)"}
**字段说明**:
- `id`:样本的唯一标识符
- `options`:多项选择选项(真/假/不确定)
- `answer`:正确答案(A/B/C)
- `question`:待解答的推理问题
- `reasoning`:逐步逻辑推理链
- `context`:背景前提与规则
- `nl2fol`:自然语言到一阶逻辑的映射关系
- `conclusion_fol`:以一阶逻辑格式表示的结论
### 训练集示例(`train/provergen-5000.json`)
json
{"system": "给定包含上下文的问题陈述,任务是回答逻辑推理问题。你的输出需为JSON格式,包含reasoning与answer两个键。", "output": "{"reasoning": "杰斯热衷于历史研究。杰斯未获得专业技能。如果杰斯热衷于历史研究,那么他要么获得实用技能,要么获得专业技能(或两者兼具)。杰斯获得了实用技能...", "answer": "C"}", "input": "正确选项为:", "instruction": "背景:
杰斯获得了学术资质。如果某人获得学术资质并掌握实用技能,那么他们可以参与考古发掘...
问题:根据上述信息,以下陈述为真、假还是不确定?如果杰斯研究塞尔柱历史或探索安纳托利亚地貌,那么他会发现隐藏的考古宝藏。
选项:
A) 真
B) 假
C) 不确定
"}
**字段说明**:
- `system`:面向模型的系统提示词
- `instruction`:完整的问题背景、提问与选项
- `input`:输入提示的后续部分
- `output`:模型需输出的包含推理过程与答案的结果
## 一阶逻辑覆盖范围
本数据集涵盖全部7类一阶逻辑关系:
- **合取(Conjunction,∧)**:与(AND)操作
- **析取(Disjunction,∨)**:或(OR)操作
- **否定(Negation,¬)**:非(NOT)操作
- **蕴含(Implication,→)**:如果-那么(IF-THEN)关系
- **互斥析取(Exclusive Disjunction,⊕)**:异或(XOR)操作
- **全称量词(Universal Quantifier,∀)**:所有(FOR ALL)语句
- **存在量词(Existential Quantifier,∃)**:存在(THERE EXISTS)语句
## 使用方式
### 加载数据集
python
from datasets import load_dataset
# 加载开发集
dev_dataset = load_dataset("opendatalab/ProverQA", data_files="dev/*.json")
# 加载训练集
train_dataset = load_dataset("opendatalab/ProverQA", data_files="train/provergen-5000.json")
# 加载特定难度等级的子集
easy_dev = load_dataset("opendatalab/ProverQA", data_files="dev/easy.json")
medium_dev = load_dataset("opendatalab/ProverQA", data_files="dev/medium.json")
hard_dev = load_dataset("opendatalab/ProverQA", data_files="dev/hard.json")
### 模型评估
本数据集支持标准提示与思维链(Chain-of-Thought,CoT)提示两种评估策略。模型的评估维度包括:
1. 解析复杂的逻辑前提与规则
2. 完成多步推理
3. 处理干扰信息
4. 判断结论为真、假或不确定
### 模型训练
训练集可用于以下场景:
1. 针对一阶逻辑推理任务微调语言模型
2. 提升思维链推理能力
3. 增强逻辑推断能力
## 基准测试结果
顶尖模型在ProverQA上仍有显著的提升空间:
| 模型 | 简单 | 中等 | 困难 |
|-------|------|--------|------|
| GPT-4o (CoT) | 94.2% | 79.4% | 50.0% |
| Claude-3.5-Sonnet (CoT) | 95.2% | 83.6% | 56.4% |
| Llama3.1-70B (CoT) | 90.4% | 73.2% | 46.8% |
| o1-preview-2024-09-12 | 89.8% | 78.8% | 66.2% |
| DeepSeek-R1 | 91.8% | 78.4% | 66.6% |
即便采用思维链提示,顶尖模型在困难子集上的准确率也仅勉强超过50%,凸显了该基准测试的挑战性。
## 数据集构建流程
ProverQA基于ProverGen框架,通过三阶段流程构建:
1. **背景故事生成**:大语言模型基于主题名称与特征关键词生成独特的上下文故事
2. **逻辑骨架生成**:符号推理器(Prover9)采用新颖的自顶向下方法构建推理树
3. **语句翻译**:大语言模型将逻辑表达式转换为对应上下文下的自然语言表述
## 引用方式
若您在研究中使用ProverQA,请引用以下文献:
bibtex
@inproceedings{
qi2025large,
title={Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation},
author={Chengwen Qi and Ren Ma and Bowen Li and He Du and Binyuan Hui and Jinwang Wu and Yuanjun Laili and Conghui He},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=C25SgeXWjE}
}
## 许可与伦理说明
- 数据集来源符合公开仓库的开源协议(名称相关内容采用MIT协议,关键词来自WordNet)
- 数据收集过程未涉及人类受试者
- 数据集不包含有害或偏见性内容
- 本数据集专为推进逻辑推理能力研究的学术用途设计
## 联系方式
若您对数据集有任何疑问或问题,请联系作者或在GitHub仓库中提交Issue。
- 程文齐(chengwen_qi@buaa.edu.cn)
- 马仁(maren@pjlab.org.cn)
- 李博文(libowen@pjlab.org.cn)
提供机构:
maas
创建时间:
2025-11-26



