batianxieshen/APRIL
收藏Hugging Face2026-04-30 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/batianxieshen/APRIL
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
size_categories:
- 100K<n<1M
task_categories:
- text-generation
pretty_name: 'APRIL: A Dataset of Lean 4 Proof Errors and Corrections'
tags:
- lean
- formal-verification
- theorem-proving
- code
- math
arxiv: 2602.02990
---
# APRIL Dataset
[**Paper**](https://huggingface.co/papers/2602.02990)
## Model & Dataset Download
| Resource | Description | Link |
|---|---|---|
| **APRIL Dataset** | 260K Lean proof-repair tuples with compiler diagnostics and explanations | [uw-math-ai/APRIL](https://huggingface.co/datasets/uw-math-ai/APRIL) |
| **gAPRIL-w-exp** | Goedel-8B finetuned on APRIL with joint explanation supervision | [uw-math-ai/gAPRIL-w-exp](https://huggingface.co/uw-math-ai/gAPRIL-w-exp) |
| **gAPRIL-wo-exp** | Goedel-8B finetuned on APRIL for repair only (no explanations) | [uw-math-ai/gAPRIL-wo-exp](https://huggingface.co/uw-math-ai/gAPRIL-wo-exp) |
## Dataset Description
The APRIL dataset contains **258,103 examples** of Lean 4 proof errors and corrections, organized by error type and split. Each example includes a correct proof, an incorrect proof that produces a specific error, and detailed metadata about the error.
### Data Splits
| Split | Total Examples |
|-------|----------------|
| Train | 249,005 (96.5%)|
| Val | 9,263 (3.6%) |
| Test | 1,835 (0.7%) |
### Error Types
The dataset categorizes errors into four types:
1. **THME (Theorem Mutation Error)** - 154,869 examples (60.0%)
- Errors produced by changing one theorem to a similar theorem
2. **TME (Tactic Mutation Error)** - 62,129 examples (24.1%)
- Errors produced by changing one to three tactics to similar tactics
3. **LME (Line Mutation Error)** - 18,319 examples (7.1%)
- Errors produced by having Deepseek rewrite a single line of the proof
4. **MLME (Multi-Line Mutation Error)** - 22,786 examples (8.8%)
- Errors Produced by having Deepseek rewrite the last chunk of the proof (less than half of the lines)
### File Organization
```
train/
├── thme_train.jsonl (148,361 examples, 1.3 GB)
├── tme_train.jsonl (59,667 examples, 265 MB)
├── mlme_train.jsonl (23,879 examples, 81 MB)
└── lme_train.jsonl (17,098 examples, 57 MB)
val/
├── thme_val.jsonl (5,415 examples, 49 MB)
├── tme_val.jsonl (2,064 examples, 9.1 MB)
├── mlme_val.jsonl (763 examples, 2.7 MB)
└── lme_val.jsonl (1,021 examples, 3.5 MB)
test/
├── thme_test.jsonl (1,093 examples, 9.0 MB)
├── tme_test.jsonl (398 examples, 1.7 MB)
├── mlme_test.jsonl (144 examples, 482 KB)
└── lme_test.jsonl (200 examples, 697 KB)
```
### Data Fields
**Common fields (all error types):**
- `correct_proof`: The complete, working Lean 4 proof (includes imports and context)
- `incorrect_proof`: The modified proof that produces an error
- `src_hash`: Hash of the source theorem
- `line`: Line number where the error occurs
- `col`: Column number where the error occurs
- `error`: The Lean compiler error message
- `line_at_error`: The specific line of code that caused the error
- `state_at_error`: The proof state when the error occurred
- `explanation`: Natural language explanation of why the proof fails
- `fix_suggestion`: Natural language suggestion for fixing the error
- `source`: Data source (e.g., "herald")
- `error_type`: One of `lme`, `tme`, `mlme`, or `thme`
- `split`: One of `train`, `val`, or `test`
- `theorem`: Theorem identifier (may be null) (used for theorem mutations only)
- `correct_name`: Name of the correct theorem/lemma (may be null) (used for theorem mutations only)
- `incorrect_name`: Name of the incorrect theorem/lemma used (may be null) (used for theorem mutations only)
- `correct_formal`: Formal statement of the correct theorem (may be null) (used for theorem mutations only)
- `incorrect_formal`: Formal statement of the incorrect theorem (may be null) (used for theorem mutations only)
- `correct_informal`: Informal description of the correct theorem (may be null) (used for theorem mutations only)
- `incorrect_informal`: Informal description of the incorrect theorem (may be null) (used for theorem mutations only)
### Download Options
1. **Individual files by split and error type** - Download only what you need
2. **Complete archive**: `data_by_split.zip` (273 MB compressed, ~1.8 GB uncompressed)
### Example Entry (LME)
```json
{
"path": "CategoryTheory.Bicategory.id_whiskerLeft_symm.lean",
"correct_proof": "import Mathlib.CategoryTheory.NatIso
import Mathlib.CategoryTheory.Bicategory.Basic
open CategoryTheory
open Bicategory
open Category Iso
variable {B : Type u} [Bicategory.{w, v} B] {a b c d e : B}
theorem theorem_28010 {f g : a ⟶ b} (η : f ⟶ g) : η = (λ_ f).inv ≫ 𝟙 a ◁ η ≫ (λ_ g).hom := by
simp",
"incorrect_proof": "import Mathlib.CategoryTheory.NatIso
import Mathlib.CategoryTheory.Bicategory.Basic
open CategoryTheory
open Bicategory
open Category Iso
variable {B : Type u} [Bicategory.{w, v} B] {a b c d e : B}
theorem theorem_28010 {f g : a ⟶ b} (η : f ⟶ g) : η = (λ_ f).inv ≫ 𝟙 a ◁ η ≫ (λ_ g).hom := by
rw [← Iso.inv_comp_eq, leftUnitor_naturality, Category.assoc]",
"theorem": null,
"src_hash": "25ba40f63317e3a83081a8d23cd8f635e281bac38cb2c2548dc0bcecd9660e64",
"correct_name": null,
"incorrect_name": null,
"correct_formal": null,
"incorrect_formal": null,
"correct_informal": null,
"incorrect_informal": null,
"line": 8,
"col": 6,
"error": "tactic 'rewrite' failed, did not find instance of the pattern in the target expression
?f = hom ?α ≫ ?g
B : Type u
inst✝ : Bicategory B
a b : B
f g : a ⟶ b
η : f ⟶ g
⊢ η = (λ_ f).inv ≫ 𝟙 a ◁ η ≫ (λ_ g).hom",
"line_at_error": "rw [← Iso.inv_comp_eq, leftUnitor_naturality, Category.assoc]",
"state_at_error": "B : Type u inst✝ : Bicategory B a b : B f g : a ⟶ b η : f ⟶ g ⊢ η = (λ_ f).inv ≫ 𝟙 a ◁ η ≫ (λ_ g).hom",
"explanation": "The proof fails because the rewrite tactic cannot match the pattern `?f = hom ?α ≫ ?g` in the target expression, as the target is not in the expected form for `Iso.inv_comp_eq` to apply directly.",
"fix_suggestion": "Use the `simp` tactic which can automatically handle all necessary rewrites and simplifications for this bicategorical identity.",
"source": "herald",
"error_type": "lme",
"split": "train"
}
```
## Citation
If you use this dataset in your research, please cite:
```bibtex
@article{april2026,
title = {Learning to Repair Lean Proofs from Compiler Feedback},
author = {Wang, Yiran and Chess, Simon and Lee, Daniel and Ge, Siyuan and Mallavarapu, Ajit and Ilin, Vasily},
howpublished = {\url{https://arxiv.org/abs/2602.02990}}
journal = {arXiv preprint},
year = {2026},
}
```
## License
This dataset is licensed under the Apache License 2.0.
## Contact
For questions or issues, please contact Vasily Ilin (https://vilin97.github.io/).
The APRIL dataset contains 258,103 examples of Lean 4 proof errors and corrections, organized by error type and split. Each example includes a correct proof, an incorrect proof that produces a specific error, and detailed metadata about the error. The dataset is divided into training set (249,005 examples, 96.5%), validation set (9,263 examples, 3.6%), and test set (1,835 examples, 0.7%). Error types are categorized into four types: Theorem Mutation Error (THME, 154,869 examples, 60.0%), Tactic Mutation Error (TME, 62,129 examples, 24.1%), Line Mutation Error (LME, 18,319 examples, 7.1%), and Multi-Line Mutation Error (MLME, 22,786 examples, 8.8%). The dataset files are organized by split and error type, containing multiple JSONL files. Each example includes fields such as correct proof, incorrect proof, error message, explanation, and fix suggestion.
提供机构:
batianxieshen
搜集汇总
数据集介绍

构建方式
APRIL数据集是基于Lean 4形式化证明系统中的编译错误与修复样例构建而成。研究者通过三种策略生成错误证明:对于定理变异错误,通过将正确定理替换为相似定理来诱导错误;对于策略变异错误,修改一至三个策略为相似策略;对于行变异和多行变异错误,则利用Deepseek模型重写单行或末尾部分证明代码。最终汇集得到258,103个修复元组,每个样例均包含正确的完整证明、产生特定编译错误的错误证明以及详细元数据。
特点
该数据集具备鲜明的结构化特征与领域针对性。其涵盖四种差异化的错误类型——定理变异、策略变异、单行变异与多行变异,并依此组织为独立的子集文件。每个示例不仅提供编译器诊断信息,还附有自然语言描述的错误解释与修复建议,将形式化语言错误与人类理解相通约。数据划分上,训练集占比96.5%,验证集与测试集分别占3.6%与0.7%,为模型微调与评估提供了均衡的分布。
使用方法
使用者可依据研究需求灵活下载数据:通过HuggingFace平台获取按错误类型与划分组织的JSONL文件,或下载完整压缩归档。推荐将该数据集用于基于编译器反馈的Lean 4证明修复系统的训练与评测。尤为适用于微调语言模型的任务设置——例如使用gAPRIL-w-exp模型时可将错误证明与编译器诊断作为输入,输出修复后的证明及自然语言解释,实现端到端的自动化证明修复。
背景与挑战
背景概述
APRIL数据集由华盛顿大学数学与人工智能团队于2026年创建,核心研究人员包括Yiran Wang、Simon Chess等,旨在应对形式化定理证明中Lean 4编译器错误难以自动修复的挑战。该数据集包含超过25.8万条证明错误与修正配对样本,覆盖定理突变、策略突变、单行及多行改写四种错误类型,并附有编译器诊断信息与自然语言解释。作为首个大规模Lean 4错误修复数据集,APRIL为训练语言模型自动定位和修正形式化证明中的错误提供了关键资源,推动了人工智能辅助数学证明的自动化进程。
当前挑战
APRIL数据集所解决的领域问题在于形式化定理证明中,编译器错误反馈复杂且类型多样,现有模型难以高效识别并修复证明中的逻辑漏洞。构建过程中面临的主要挑战包括:如何系统化生成高覆盖率的错误样本以保证数据多样性,如何精确提取错误位置、编译状态及修复建议以形成结构化训练目标,以及如何平衡不同错误类型的数据量以避免模型偏倚。此外,Lean 4语言本身的演化性与数学证明的深度嵌套结构也给数据标注与质量维护带来了额外难度。
常用场景
经典使用场景
在形式化验证与定理证明的前沿领域,APRIL数据集为训练和评估基于深度学习的Lean 4证明修复模型提供了规模化、结构化的语料资源。其核心应用场景涵盖错误诊断与自动修正两大任务:研究者可利用其包含的二十五万余条由正确证明、错误证明及编译器诊断信息构成的三元组,训练语言模型在给定错误代码片段和编译错误信息时,准确预测修复策略。该数据集按错误成因细分为定理变异、策略变异、单行改写及多行改写四类,支持细粒度的错误分类与定位研究,为构建端到端的自动证明修复系统奠定了坚实的基准基础。
解决学术问题
APRIL数据集系统性地回应了现有定理证明辅助工具在自动化纠错方面的关键挑战——即缺乏大规模、带有编译器反馈的错误修正配对数据。该资源通过构建四类典型的证明错误变异范式,有效填补了形式化证明自动修复领域高保真训练数据的空白,使得研究者能够将证明修复视为一个结合代码理解与编译器信号的多模态学习问题。其意义在于推动了智能编程助手从单纯的证明生成向鲁棒的故障诊断与自愈能力演进,为提升形式化验证系统的实用性与开发效率开辟了新的路径。
衍生相关工作
围绕APRIL数据集已衍生出若干有影响力的创新工作。一方面,研究者基于该数据微调出了gAPRIL-w-exp与gAPRIL-wo-exp系列模型,验证了在修复监督信号中融入自然语言解释对提升模型泛化性能的有效性。另一方面,该数据集被用于探究编译器错误消息的隐式语义编码方法,启发后续工作如将证明状态编码为结构化图以增强修复模型对局部错误的感知能力。此外,APRIL的错误分类框架为后续合成更复杂多步修复链的数据生成流程提供了范本,推动了形式化证明自动修复领域从单一错误修复向连续纠错模式的演进。
以上内容由遇见数据集搜集并总结生成



