banach1729/goedel-workbook-lean427
收藏Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/banach1729/goedel-workbook-lean427
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
language:
- en
tags:
- lean4
- mathlib
- theorem-proving
- tactic-prediction
- competition-math
- formal-verification
size_categories:
- 10K<n<100K
pretty_name: Goedel Workbook Proofs (Lean 4.27)
dataset_info:
features:
- name: theorem
dtype: string
- name: state
dtype: string
- name: tactic
dtype: string
- name: depth
dtype: int64
- name: source
dtype: string
splits:
- name: train
num_examples: 60341
---
# Goedel Workbook Proofs — Lean 4.27
29,750 competition-math proofs from [Goedel-LM/Lean-workbook-proofs](https://huggingface.co/datasets/Goedel-LM/Lean-workbook-proofs), migrated from **Lean 4.8** to **Lean 4.27.0 / Mathlib v4.27.0**.
The original proofs were generated by [DeepSeek-Prover-V1.5](https://arxiv.org/abs/2408.08152) against the [Lean Workbook](https://huggingface.co/datasets/internlm/Lean-Workbook) problem set.
## Quick Stats
| Metric | Value |
|--------|-------|
| Total proofs | 29,750 |
| Compiling on Lean 4.27 | 28,016 (94.1%) |
| Traced tactic pairs | 60,341 |
| Theorems with traced pairs | 24,879 |
| Unique tactic heads | 73 |
| Median proof depth | 1 |
## Dataset Structure
### Tactic Pairs (`data/tactic_pairs.jsonl`)
The main ML-usable artifact. Each line is a JSON object:
```json
{"theorem": "lean_workbook_10009", "state": "a : ℝ\nb : ℝ\n⊢ a ^ 2 + b ^ 2 ≥ 0", "tactic": "nlinarith [sq_nonneg a, sq_nonneg b]", "depth": 0, "source": "goedel_workbook"}
```
| Field | Description |
|-------|-------------|
| `theorem` | Upstream problem ID from Lean Workbook |
| `state` | Lean 4 tactic state (hypotheses + goal) before the tactic |
| `tactic` | The tactic applied |
| `depth` | Proof tree depth (0 = root goal) |
| `source` | Always `"goedel_workbook"` |
### Lean Proof Files (`GoedelProofs/`)
29,750 individual `.lean` files, each containing a single theorem with its proof. Every file imports Mathlib and Aesop.
### Metadata (`data/`)
- `compile_status.json` — per-proof compilation status (`ok`, `warn`, `error`, `timeout`)
- `integrity_report.json` — sorry/admit/cheat contamination sweep (0 found in 28,016 compiling proofs)
- `manifest.json` — mapping between proof files and upstream problem IDs
## Loading the Data
```python
import json
pairs = [json.loads(line) for line in open("data/tactic_pairs.jsonl")]
print(f"{len(pairs)} tactic pairs from {len(set(p['theorem'] for p in pairs))} theorems")
# Filter to deep proofs
deep = [p for p in pairs if p["depth"] >= 4]
```
Or with HuggingFace datasets:
```python
from datasets import load_dataset
ds = load_dataset("banach1729/goedel-workbook-lean427", data_files="data/tactic_pairs.jsonl", split="train")
```
## Proof Depth Distribution
| Depth | Theorems | % |
|-------|----------|---|
| 0 (single-tactic) | 11,512 | 46.3% |
| 1 | 6,344 | 25.5% |
| 2-3 | 3,735 | 15.0% |
| 4-6 | 2,178 | 8.8% |
| 7+ | 1,110 | 4.5% |
## Top Tactics
| Tactic | % of steps |
|--------|-----------|
| have | 26.9% |
| nlinarith | 17.7% |
| intro | 6.7% |
| field_simp | 5.9% |
| ring_nf | 5.0% |
## Migration
The migration from Lean 4.8 → 4.27 involved Mathlib lemma renames, BigOperators notation changes (`in` → `∈`), and `field_simp` behavioral fixes. 94.1% of proofs compile successfully. See the [GitHub repository](https://github.com/mike1729/goedel-workbook-lean427) for full migration scripts and documentation.
## Known Limitations
- 5.9% of proofs don't compile on Lean 4.27 (included for completeness)
- Machine-generated proofs are shallow: 46% single-tactic, mostly `nlinarith`
- Heavily competition algebra focused (71% inequality goals)
- Tactic pairs extracted via Pantograph replay (no source position metadata)
## Citation
```bibtex
@article{lin2025goedel,
title={Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving},
author={Lin, Yong and Agarwal, Shange and Jiang, Bohan and Lu, Changran and Yuan, Yufei and Liu, Hao and Wang, Tongxuan and Ruan, Zhongjian and Zhang, Qingyu and Zeng, Aimin and others},
journal={arXiv preprint arXiv:2502.07640},
year={2025}
}
@article{xin2024deepseek,
title={DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search},
author={Xin, Huajian and Guo, Daya and Shao, Zhihong and Ren, Zhizhou and Zhu, Qihao and Liu, Bo and Ruan, Chong and Li, Wenda and Liang, Xiaodan},
journal={arXiv preprint arXiv:2408.08152},
year={2024}
}
```
## License
Apache 2.0
提供机构:
banach1729



