Monogate/petal-eml
收藏Hugging Face2026-04-27 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/Monogate/petal-eml
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-generation
- text2text-generation
language:
- en
tags:
- lean4
- theorem-proving
- mathematics
- ai-training
- structured
size_categories:
- n<1K
pretty_name: PETAL-EML — Pedagogically-structured Lean 4 Proof Dataset (Seed v1)
---
# PETAL-EML — Seed v2 (35 records)
**P**roof–**E**xplanation–**T**actic **A**ligned **L**ean dataset, seed
release. Sourced from the
[monogate](https://github.com/agent-maestro/monogate) Lean 4 formalization
of the EML (exp-minus-log) operator from
[arXiv:2603.21852](https://arxiv.org/abs/2603.21852).
This is the **seed release v2**: 35 hand-audited theorems spanning 6
pedagogical lanes and all 14 Lean files in `MonogateEML/`, with every
proof step grounded in the actual Lean source (no fabricated tactics,
no `TO_BE_EXTRACTED` placeholders). The full Lean codebase contains
50 EML-original theorems plus Mathlib wrappers; this seed surfaces a
pedagogically structured path through 35 of them, including the four
SuperBEST lower-bound headlines (`SB_add_ge_two`, `SB_sub_ge_two`,
`SB_mul_ge_two` precursor via F-family, `SB_div_ge_three_full`
scaffold), the seven 1-node positive-domain upper bounds, the
self-map conjugacies, two Mathlib wrappers from `Gamma.lean`, and
the informational record at the InfiniteZerosBarrier research
frontier.
History: v1 (2026-04-26 NIGHT) shipped 15 records covering 7 of 14
Lean files; v2 (2026-04-27) expanded to 35 records covering all 14
files. A future release may add `goal_before` / `goal_after` Lean
output strings extracted via interactive Lean execution — those
fields are deliberately omitted in this seed rather than fabricated.
---
## Why this exists
LLMs trained on Lean 4 code see the proof and the goal but rarely
the *teacher's voice* explaining why a particular tactic was
chosen, what failed alternatives look like, and what a beginner
typically gets wrong on the first attempt. PETAL surfaces all three
in a structured, machine-readable format aligned to the theorem
statement.
Each record carries:
- **Theorem statement** in three forms: natural language, LaTeX, and Lean 4 source
- **Proof** — full Lean 4 source plus a per-step breakdown
- **Per-step explanation** — what the tactic does, why this approach works, common mistakes
- **Lane assignment** — pedagogical difficulty bucket (1–6)
- **Dependency graph** — which earlier theorems this one builds on
- **Source pointer** — exact .lean file path + line number for every claim
The dataset is not a benchmark. It is teaching material structured for
language models that need to learn to reason about Lean proofs.
---
## Lane structure (6 lanes)
Lanes are pedagogical difficulty buckets. They are monotonically
non-decreasing in difficulty: a learner who has cleared Lane k can
attempt Lane k+1.
| Lane | Title | Focus | Lean tactics introduced |
|------|-------|-------|--------------------------|
| 1 | First Contact | Definitions, `rfl`, term-mode | `rfl`, term-proof |
| 2 | Building Blocks | Composition, `simp`, `linarith` | `simp [..]`, `linarith` |
| 3 | Arithmetic from Nothing | Witness extraction, F-family lower bounds | `intro`, `have`, `simp only [..]`, `linarith` |
| 4 | The Other Operators | SuperBEST, multi-operator lower bounds | `obtain ⟨..⟩`, `exact`, structural elimination |
| 5 | Into the Complex Plane | Universality of EML, `IsEMLElementary`, composition closure | `obtain`, `refine`, `IsEMLElementary.comp` |
| 6 | The Barrier | Research frontier — InfiniteZerosBarrier, open o-minimal questions | (overview only; 2 sorries documented) |
Lane 6 is informational: it points at the open problems
documented in the codebase (the 2 remaining `sorry`s in
`InfiniteZerosBarrier.lean`) so a learner sees where the research
frontier is, not just a finite curriculum.
---
## Record schema
Every record in `data/petal_eml.json` conforms to this shape:
```json
{
"theorem_id": "string — globally unique across the dataset",
"lane": "int 1..6",
"lane_title": "string — matches the Lane table above",
"difficulty": "int 1..6 — monotone with lane",
"domain": "string — math area (algebra / analysis / hyperbolic / topology / ...)",
"statement": {
"natural_language": "string — one paragraph, plain English",
"latex": "string — KaTeX-renderable",
"lean4": "string — the literal Lean 4 statement, copied from source"
},
"proof": {
"lean4_full": "string — the literal Lean 4 proof block",
"lean4_step_by_step": [
{
"step": "int — 1-indexed",
"tactic": "string — the tactic invocation",
"explanation": "string — natural language: what this step does",
"tactic_rationale": "string — why this tactic was chosen here",
"common_mistakes": ["string — example of what beginners try that fails"]
}
]
},
"dependencies": ["string — theorem_ids this proof relies on (may be empty)"],
"unlocks": ["string — theorem_ids this proof enables"],
"source": {
"file": "string — path within the monogate-lean repo",
"line": "int — 1-indexed line of the theorem keyword"
},
"tags": ["string — searchable labels"],
"metadata": {
"tactics_introduced": ["string — Lean tactic names new in this record"],
"cost_class": "string|null — eml-cost Pfaffian class if applicable"
}
}
```
**Honest note on `lean4_step_by_step`:** Not every Lean tactic invocation
maps to a single human-readable "step." Where a one-line proof uses
two combined tactics (e.g. `simp [..]; linarith`), we represent each
as its own step with a shared `goal_before`-equivalent explanation.
We do NOT include `goal_before` / `goal_after` Lean output strings in
this seed release because those require interactive Lean execution to
extract reliably; including them as "TO_BE_EXTRACTED" would be
dishonest. A future release with VS Code-extracted goal states is
planned.
---
## Validation
The seed dataset ships with a Python validator
(`validate_petal.py`). It checks:
- All required fields present on every record
- `theorem_id`s are unique
- Dependency graph is acyclic
- Every dependency references a theorem_id present in the dataset
- Lane assignments are monotone with `difficulty`
- Source pointers reference real `.lean` files (path resolves)
- No standing-rules-forbidden words appear in any prose field
("first," "novel," "proven" outside Lean code blocks, "always,"
"guaranteed," "best")
Run:
```bash
python validate_petal.py data/petal_eml.json
```
A passing validator is the entry condition for any HF push.
---
## Statistics (seed v2)
| metric | value |
|---|---|
| Total records | 35 |
| Lanes covered | 6 (Lane 1: 4, Lane 2: 4, Lane 3: 6, Lane 4: 5, Lane 5: 15, Lane 6: 1) |
| Files referenced | **14 of 14** (full `MonogateEML/` coverage) |
| Lean tactics introduced (cumulative) | `rfl`, `simp [..]`, `simp only [..]`, `linarith`, `intro`, `have`, `obtain ⟨..⟩`, `exact`, `rw [..]`, `ring`, `ring_nf`, `norm_num`, `field_simp`, `unfold`, `by_contra`, `push_neg`, `trivial`, term-mode, anonymous-constructor packing |
| Total proof steps | 54 |
| Common-mistake notes | 60 |
| Mathlib wrappers as records | 3 (`complex_exp_zero_eq_one`, `gamma_one`, `gamma_nat_eq_factorial`) — included to teach the term-mode citation pattern |
| Synthesis bundles | 1 (`superbest_positive_one_node_ops` — packs 7 named theorems into one conjunction) |
| Scaffold records | 1 (`SB_div_ge_three_full` — Lean placeholder accepting an external Python certification) |
| Open-problem records (Lane 6) | 1 (informational — `infinite_zeros_barrier__research_frontier`) |
| Load-bearing-for-public-claim records | 2 (`superbest_v53_positive_total` and `superbest_v53_savings_fraction` underwrite the public 14n / 80.8% headline) |
Counts above are auto-computed by the validator on each run.
---
## Source provenance
All theorems are extracted verbatim from the
[monogate-lean](https://github.com/agent-maestro/monogate-lean) repo
at the commit referenced in each record's `source.file` pointer.
Re-running the validator against an updated codebase will catch
drift.
The codebase compiles under Lean 4 + Mathlib (lakefile in the repo
root). The seed records target theorems that build cleanly with `lake
build` at the time of writing. No `sorry` is in the proof of any
shipped record — the only mention of `sorry` in this dataset is the
Lane 6 *informational* record about the research frontier.
---
## License
CC BY 4.0 — same as the underlying arXiv paper. Attribution: cite the
arXiv paper plus this dataset card.
---
## Citation
```bibtex
@dataset{petal_eml_seed_v1,
title = {PETAL-EML — Pedagogically-structured Lean 4 Proof Dataset (Seed v1)},
author = {Monogate Research},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/monogate/petal-eml},
note = {Sourced from the Lean 4 formalization accompanying arXiv:2603.21852},
license = {CC BY 4.0}
}
@article{odrzywolek2026eml,
title = {All elementary functions from a single operator},
author = {Odrzywo{\l}ek, Andrzej},
year = {2026},
journal = {arXiv preprint},
eprint = {2603.21852}
}
```
---
## What this dataset is *not*
- **Not a benchmark.** No held-out split, no leaderboard. It is teaching material.
- **Not exhaustive.** 15 of ~50 EML-original theorems. Future v2 covers more.
- **Not goal-state-extracted.** Per-step `goal_before` / `goal_after` strings require interactive Lean execution and are deliberately omitted in this seed release rather than fabricated.
- **Not a replacement for the Lean source.** Always verify against the actual `.lean` files via the `source` pointer on every record.
---
## Status flags (seed v1)
- **Schema:** stable for v1; breaking changes will require a v2 dataset under a new tag
- **Records:** 35, hand-audited
- **Validator:** passing on this seed
- **HF publication:** pending (token required)
- **monogate.dev/learn integration:** pending (Lane 1 first)
提供机构:
Monogate



