xu3kev/proof-pile-2-proofsteps

Name: xu3kev/proof-pile-2-proofsteps
Creator: xu3kev
Published: 2023-10-28 21:45:17
License: 暂无描述

Hugging Face2023-10-28 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/xu3kev/proof-pile-2-proofsteps

下载链接

链接失效反馈

官方服务：

资源简介：

--- configs: - config_name: default data_files: - split: lean_proofsteps path: "lean_proofsteps/*.parquet" - split: isa_proofsteps path: "isa_proofsteps/*.parquet" --- Proofsteps data from Proof-Pile-2 includes proofsteps for Lean and Isabelle ```python from datasets import load_dataset ds = load_dataset( "xu3kev/proof-pile-2-proofsteps" ) ds DatasetDict({ lean_proofsteps: Dataset({ features: ['text', 'meta'], num_rows: 3432 }) isa_proofsteps: Dataset({ features: ['text', 'meta'], num_rows: 260726 }) }) ``` Quoting from appendix of [LLEMMA: AN OPEN LANGUAGE MODEL FOR MATHEMATICS](https://arxiv.org/pdf/2310.10631.pdf) ``` B.1.2 LEAN PROOFSTEPS We extract a dataset of (tactic state, next tactic) pairs from Mathlib 4 (mathlib Community, 2020) using the lean-training-data (Morrison, 2023) tool. We use Mathlib 4 commit c779bd5, which was created on August 20th 2023. B.1.3 ISABELLE PROOFSTEPS We construct a dataset of Isabelle proofs, building upon the PISA dataset Jiang et al. (2021). Isabelle Proofsteps comprises proofs from the Archive of Formal Proofs and Isabelle Standard Library, scraped with PISA Jiang et al. (2021). Each entry in the dataset includes the theorem statement, the proof states and the proof steps, separated by specific tags. To maintain the integrity of evaluations using the PISA test set, we decontaminate Isabelle Proofsteps by removing theorems whose names overlap with those in the PISA test set. Although this approach results in a strict filtering – removing more than 10,000 theorems although there are only 3600 in the PISA test set – we consider it acceptable in order to mitigate data contamination. After filtering, Isabelle Proofsteps contains 251,000 theorems. ```

提供机构：

xu3kev

原始信息汇总

数据集概述

数据集名称

Proof-Pile-2-Proofsteps

数据集来源

从Mathlib 4和Isabelle的证明步骤中提取。

数据集内容

数据集包含两个部分：

lean_proofsteps: 包含3432条记录，每条记录包含text和meta两个特征。
isa_proofsteps: 包含260726条记录，每条记录包含text和meta两个特征。

数据集描述

lean_proofsteps: 从Mathlib 4中提取的(tactic state, next tactic)对，使用lean-training-data工具。
isa_proofsteps: 从Archive of Formal Proofs和Isabelle Standard Library中提取的Isabelle证明步骤，使用PISA工具进行数据清洗，去除了与PISA测试集重叠的定理，最终包含251,000个定理。

数据集加载

python from datasets import load_dataset ds = load_dataset("xu3kev/proof-pile-2-proofsteps")

5,000+

优质数据集

54 个

任务类型

进入经典数据集