five

Monogate/petal-eml

收藏
Hugging Face2026-04-27 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/Monogate/petal-eml
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-generation - text2text-generation language: - en tags: - lean4 - theorem-proving - mathematics - ai-training - structured size_categories: - n<1K pretty_name: PETAL-EML — Pedagogically-structured Lean 4 Proof Dataset (Seed v1) --- # PETAL-EML — Seed v2 (35 records) **P**roof–**E**xplanation–**T**actic **A**ligned **L**ean dataset, seed release. Sourced from the [monogate](https://github.com/agent-maestro/monogate) Lean 4 formalization of the EML (exp-minus-log) operator from [arXiv:2603.21852](https://arxiv.org/abs/2603.21852). This is the **seed release v2**: 35 hand-audited theorems spanning 6 pedagogical lanes and all 14 Lean files in `MonogateEML/`, with every proof step grounded in the actual Lean source (no fabricated tactics, no `TO_BE_EXTRACTED` placeholders). The full Lean codebase contains 50 EML-original theorems plus Mathlib wrappers; this seed surfaces a pedagogically structured path through 35 of them, including the four SuperBEST lower-bound headlines (`SB_add_ge_two`, `SB_sub_ge_two`, `SB_mul_ge_two` precursor via F-family, `SB_div_ge_three_full` scaffold), the seven 1-node positive-domain upper bounds, the self-map conjugacies, two Mathlib wrappers from `Gamma.lean`, and the informational record at the InfiniteZerosBarrier research frontier. History: v1 (2026-04-26 NIGHT) shipped 15 records covering 7 of 14 Lean files; v2 (2026-04-27) expanded to 35 records covering all 14 files. A future release may add `goal_before` / `goal_after` Lean output strings extracted via interactive Lean execution — those fields are deliberately omitted in this seed rather than fabricated. --- ## Why this exists LLMs trained on Lean 4 code see the proof and the goal but rarely the *teacher's voice* explaining why a particular tactic was chosen, what failed alternatives look like, and what a beginner typically gets wrong on the first attempt. PETAL surfaces all three in a structured, machine-readable format aligned to the theorem statement. Each record carries: - **Theorem statement** in three forms: natural language, LaTeX, and Lean 4 source - **Proof** — full Lean 4 source plus a per-step breakdown - **Per-step explanation** — what the tactic does, why this approach works, common mistakes - **Lane assignment** — pedagogical difficulty bucket (1–6) - **Dependency graph** — which earlier theorems this one builds on - **Source pointer** — exact .lean file path + line number for every claim The dataset is not a benchmark. It is teaching material structured for language models that need to learn to reason about Lean proofs. --- ## Lane structure (6 lanes) Lanes are pedagogical difficulty buckets. They are monotonically non-decreasing in difficulty: a learner who has cleared Lane k can attempt Lane k+1. | Lane | Title | Focus | Lean tactics introduced | |------|-------|-------|--------------------------| | 1 | First Contact | Definitions, `rfl`, term-mode | `rfl`, term-proof | | 2 | Building Blocks | Composition, `simp`, `linarith` | `simp [..]`, `linarith` | | 3 | Arithmetic from Nothing | Witness extraction, F-family lower bounds | `intro`, `have`, `simp only [..]`, `linarith` | | 4 | The Other Operators | SuperBEST, multi-operator lower bounds | `obtain ⟨..⟩`, `exact`, structural elimination | | 5 | Into the Complex Plane | Universality of EML, `IsEMLElementary`, composition closure | `obtain`, `refine`, `IsEMLElementary.comp` | | 6 | The Barrier | Research frontier — InfiniteZerosBarrier, open o-minimal questions | (overview only; 2 sorries documented) | Lane 6 is informational: it points at the open problems documented in the codebase (the 2 remaining `sorry`s in `InfiniteZerosBarrier.lean`) so a learner sees where the research frontier is, not just a finite curriculum. --- ## Record schema Every record in `data/petal_eml.json` conforms to this shape: ```json { "theorem_id": "string — globally unique across the dataset", "lane": "int 1..6", "lane_title": "string — matches the Lane table above", "difficulty": "int 1..6 — monotone with lane", "domain": "string — math area (algebra / analysis / hyperbolic / topology / ...)", "statement": { "natural_language": "string — one paragraph, plain English", "latex": "string — KaTeX-renderable", "lean4": "string — the literal Lean 4 statement, copied from source" }, "proof": { "lean4_full": "string — the literal Lean 4 proof block", "lean4_step_by_step": [ { "step": "int — 1-indexed", "tactic": "string — the tactic invocation", "explanation": "string — natural language: what this step does", "tactic_rationale": "string — why this tactic was chosen here", "common_mistakes": ["string — example of what beginners try that fails"] } ] }, "dependencies": ["string — theorem_ids this proof relies on (may be empty)"], "unlocks": ["string — theorem_ids this proof enables"], "source": { "file": "string — path within the monogate-lean repo", "line": "int — 1-indexed line of the theorem keyword" }, "tags": ["string — searchable labels"], "metadata": { "tactics_introduced": ["string — Lean tactic names new in this record"], "cost_class": "string|null — eml-cost Pfaffian class if applicable" } } ``` **Honest note on `lean4_step_by_step`:** Not every Lean tactic invocation maps to a single human-readable "step." Where a one-line proof uses two combined tactics (e.g. `simp [..]; linarith`), we represent each as its own step with a shared `goal_before`-equivalent explanation. We do NOT include `goal_before` / `goal_after` Lean output strings in this seed release because those require interactive Lean execution to extract reliably; including them as "TO_BE_EXTRACTED" would be dishonest. A future release with VS Code-extracted goal states is planned. --- ## Validation The seed dataset ships with a Python validator (`validate_petal.py`). It checks: - All required fields present on every record - `theorem_id`s are unique - Dependency graph is acyclic - Every dependency references a theorem_id present in the dataset - Lane assignments are monotone with `difficulty` - Source pointers reference real `.lean` files (path resolves) - No standing-rules-forbidden words appear in any prose field ("first," "novel," "proven" outside Lean code blocks, "always," "guaranteed," "best") Run: ```bash python validate_petal.py data/petal_eml.json ``` A passing validator is the entry condition for any HF push. --- ## Statistics (seed v2) | metric | value | |---|---| | Total records | 35 | | Lanes covered | 6 (Lane 1: 4, Lane 2: 4, Lane 3: 6, Lane 4: 5, Lane 5: 15, Lane 6: 1) | | Files referenced | **14 of 14** (full `MonogateEML/` coverage) | | Lean tactics introduced (cumulative) | `rfl`, `simp [..]`, `simp only [..]`, `linarith`, `intro`, `have`, `obtain ⟨..⟩`, `exact`, `rw [..]`, `ring`, `ring_nf`, `norm_num`, `field_simp`, `unfold`, `by_contra`, `push_neg`, `trivial`, term-mode, anonymous-constructor packing | | Total proof steps | 54 | | Common-mistake notes | 60 | | Mathlib wrappers as records | 3 (`complex_exp_zero_eq_one`, `gamma_one`, `gamma_nat_eq_factorial`) — included to teach the term-mode citation pattern | | Synthesis bundles | 1 (`superbest_positive_one_node_ops` — packs 7 named theorems into one conjunction) | | Scaffold records | 1 (`SB_div_ge_three_full` — Lean placeholder accepting an external Python certification) | | Open-problem records (Lane 6) | 1 (informational — `infinite_zeros_barrier__research_frontier`) | | Load-bearing-for-public-claim records | 2 (`superbest_v53_positive_total` and `superbest_v53_savings_fraction` underwrite the public 14n / 80.8% headline) | Counts above are auto-computed by the validator on each run. --- ## Source provenance All theorems are extracted verbatim from the [monogate-lean](https://github.com/agent-maestro/monogate-lean) repo at the commit referenced in each record's `source.file` pointer. Re-running the validator against an updated codebase will catch drift. The codebase compiles under Lean 4 + Mathlib (lakefile in the repo root). The seed records target theorems that build cleanly with `lake build` at the time of writing. No `sorry` is in the proof of any shipped record — the only mention of `sorry` in this dataset is the Lane 6 *informational* record about the research frontier. --- ## License CC BY 4.0 — same as the underlying arXiv paper. Attribution: cite the arXiv paper plus this dataset card. --- ## Citation ```bibtex @dataset{petal_eml_seed_v1, title = {PETAL-EML — Pedagogically-structured Lean 4 Proof Dataset (Seed v1)}, author = {Monogate Research}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/monogate/petal-eml}, note = {Sourced from the Lean 4 formalization accompanying arXiv:2603.21852}, license = {CC BY 4.0} } @article{odrzywolek2026eml, title = {All elementary functions from a single operator}, author = {Odrzywo{\l}ek, Andrzej}, year = {2026}, journal = {arXiv preprint}, eprint = {2603.21852} } ``` --- ## What this dataset is *not* - **Not a benchmark.** No held-out split, no leaderboard. It is teaching material. - **Not exhaustive.** 15 of ~50 EML-original theorems. Future v2 covers more. - **Not goal-state-extracted.** Per-step `goal_before` / `goal_after` strings require interactive Lean execution and are deliberately omitted in this seed release rather than fabricated. - **Not a replacement for the Lean source.** Always verify against the actual `.lean` files via the `source` pointer on every record. --- ## Status flags (seed v1) - **Schema:** stable for v1; breaking changes will require a v2 dataset under a new tag - **Records:** 35, hand-audited - **Validator:** passing on this seed - **HF publication:** pending (token required) - **monogate.dev/learn integration:** pending (Lane 1 first)
提供机构:
Monogate
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作