uw-math-ai/APRIL

Name: uw-math-ai/APRIL
Creator: uw-math-ai
Published: 2026-02-27 21:57:15
License: 暂无描述

Hugging Face2026-02-27 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/uw-math-ai/APRIL

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: apache-2.0 size_categories: - 100K<n<1M task_categories: - text-generation pretty_name: 'APRIL: A Dataset of Lean 4 Proof Errors and Corrections' tags: - lean - formal-verification - theorem-proving - code - math arxiv: 2602.02990 --- # APRIL Dataset [**Paper**](https://huggingface.co/papers/2602.02990) ## Model & Dataset Download | Resource | Description | Link | |---|---|---| | **APRIL Dataset** | 260K Lean proof-repair tuples with compiler diagnostics and explanations | [uw-math-ai/APRIL](https://huggingface.co/datasets/uw-math-ai/APRIL) | | **gAPRIL-w-exp** | Goedel-8B finetuned on APRIL with joint explanation supervision | [uw-math-ai/gAPRIL-w-exp](https://huggingface.co/uw-math-ai/gAPRIL-w-exp) | | **gAPRIL-wo-exp** | Goedel-8B finetuned on APRIL for repair only (no explanations) | [uw-math-ai/gAPRIL-wo-exp](https://huggingface.co/uw-math-ai/gAPRIL-wo-exp) | ## Dataset Description The APRIL dataset contains **258,103 examples** of Lean 4 proof errors and corrections, organized by error type and split. Each example includes a correct proof, an incorrect proof that produces a specific error, and detailed metadata about the error. ### Data Splits | Split | Total Examples | |-------|----------------| | Train | 249,005 (96.5%)| | Val | 9,263 (3.6%) | | Test | 1,835 (0.7%) | ### Error Types The dataset categorizes errors into four types: 1. **THME (Theorem Mutation Error)** - 154,869 examples (60.0%) - Errors produced by changing one theorem to a similar theorem 2. **TME (Tactic Mutation Error)** - 62,129 examples (24.1%) - Errors produced by changing one to three tactics to similar tactics 3. **LME (Line Mutation Error)** - 18,319 examples (7.1%) - Errors produced by having Deepseek rewrite a single line of the proof 4. **MLME (Multi-Line Mutation Error)** - 22,786 examples (8.8%) - Errors Produced by having Deepseek rewrite the last chunk of the proof (less than half of the lines) ### File Organization ``` train/ ├── thme_train.jsonl (148,361 examples, 1.3 GB) ├── tme_train.jsonl (59,667 examples, 265 MB) ├── mlme_train.jsonl (23,879 examples, 81 MB) └── lme_train.jsonl (17,098 examples, 57 MB) val/ ├── thme_val.jsonl (5,415 examples, 49 MB) ├── tme_val.jsonl (2,064 examples, 9.1 MB) ├── mlme_val.jsonl (763 examples, 2.7 MB) └── lme_val.jsonl (1,021 examples, 3.5 MB) test/ ├── thme_test.jsonl (1,093 examples, 9.0 MB) ├── tme_test.jsonl (398 examples, 1.7 MB) ├── mlme_test.jsonl (144 examples, 482 KB) └── lme_test.jsonl (200 examples, 697 KB) ``` ### Data Fields **Common fields (all error types):** - `correct_proof`: The complete, working Lean 4 proof (includes imports and context) - `incorrect_proof`: The modified proof that produces an error - `src_hash`: Hash of the source theorem - `line`: Line number where the error occurs - `col`: Column number where the error occurs - `error`: The Lean compiler error message - `line_at_error`: The specific line of code that caused the error - `state_at_error`: The proof state when the error occurred - `explanation`: Natural language explanation of why the proof fails - `fix_suggestion`: Natural language suggestion for fixing the error - `source`: Data source (e.g., "herald") - `error_type`: One of `lme`, `tme`, `mlme`, or `thme` - `split`: One of `train`, `val`, or `test` - `theorem`: Theorem identifier (may be null) (used for theorem mutations only) - `correct_name`: Name of the correct theorem/lemma (may be null) (used for theorem mutations only) - `incorrect_name`: Name of the incorrect theorem/lemma used (may be null) (used for theorem mutations only) - `correct_formal`: Formal statement of the correct theorem (may be null) (used for theorem mutations only) - `incorrect_formal`: Formal statement of the incorrect theorem (may be null) (used for theorem mutations only) - `correct_informal`: Informal description of the correct theorem (may be null) (used for theorem mutations only) - `incorrect_informal`: Informal description of the incorrect theorem (may be null) (used for theorem mutations only) ### Download Options 1. **Individual files by split and error type** - Download only what you need 2. **Complete archive**: `data_by_split.zip` (273 MB compressed, ~1.8 GB uncompressed) ### Example Entry (LME) ```json { "path": "CategoryTheory.Bicategory.id_whiskerLeft_symm.lean", "correct_proof": "import Mathlib.CategoryTheory.NatIso import Mathlib.CategoryTheory.Bicategory.Basic open CategoryTheory open Bicategory open Category Iso variable {B : Type u} [Bicategory.{w, v} B] {a b c d e : B} theorem theorem_28010 {f g : a ⟶ b} (η : f ⟶ g) : η = (λ_ f).inv ≫ 𝟙 a ◁ η ≫ (λ_ g).hom := by simp", "incorrect_proof": "import Mathlib.CategoryTheory.NatIso import Mathlib.CategoryTheory.Bicategory.Basic open CategoryTheory open Bicategory open Category Iso variable {B : Type u} [Bicategory.{w, v} B] {a b c d e : B} theorem theorem_28010 {f g : a ⟶ b} (η : f ⟶ g) : η = (λ_ f).inv ≫ 𝟙 a ◁ η ≫ (λ_ g).hom := by rw [← Iso.inv_comp_eq, leftUnitor_naturality, Category.assoc]", "theorem": null, "src_hash": "25ba40f63317e3a83081a8d23cd8f635e281bac38cb2c2548dc0bcecd9660e64", "correct_name": null, "incorrect_name": null, "correct_formal": null, "incorrect_formal": null, "correct_informal": null, "incorrect_informal": null, "line": 8, "col": 6, "error": "tactic 'rewrite' failed, did not find instance of the pattern in the target expression ?f = hom ?α ≫ ?g B : Type u inst✝ : Bicategory B a b : B f g : a ⟶ b η : f ⟶ g ⊢ η = (λ_ f).inv ≫ 𝟙 a ◁ η ≫ (λ_ g).hom", "line_at_error": "rw [← Iso.inv_comp_eq, leftUnitor_naturality, Category.assoc]", "state_at_error": "B : Type u inst✝ : Bicategory B a b : B f g : a ⟶ b η : f ⟶ g ⊢ η = (λ_ f).inv ≫ 𝟙 a ◁ η ≫ (λ_ g).hom", "explanation": "The proof fails because the rewrite tactic cannot match the pattern `?f = hom ?α ≫ ?g` in the target expression, as the target is not in the expected form for `Iso.inv_comp_eq` to apply directly.", "fix_suggestion": "Use the `simp` tactic which can automatically handle all necessary rewrites and simplifications for this bicategorical identity.", "source": "herald", "error_type": "lme", "split": "train" } ``` ## Citation If you use this dataset in your research, please cite: ```bibtex @article{april2026, title = {Learning to Repair Lean Proofs from Compiler Feedback}, author = {Wang, Yiran and Chess, Simon and Lee, Daniel and Ge, Siyuan and Mallavarapu, Ajit and Ilin, Vasily}, howpublished = {\url{https://arxiv.org/abs/2602.02990}} journal = {arXiv preprint}, year = {2026}, } ``` ## License This dataset is licensed under the Apache License 2.0. ## Contact For questions or issues, please contact Vasily Ilin (https://vilin97.github.io/).

提供机构：

uw-math-ai

5,000+

优质数据集

54 个

任务类型

进入经典数据集