xu3kev/proof-pile-2-proofsteps
收藏Hugging Face2023-10-28 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/xu3kev/proof-pile-2-proofsteps
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: default
data_files:
- split: lean_proofsteps
path: "lean_proofsteps/*.parquet"
- split: isa_proofsteps
path: "isa_proofsteps/*.parquet"
---
Proofsteps data from Proof-Pile-2
includes proofsteps for Lean and Isabelle
```python
from datasets import load_dataset
ds = load_dataset(
"xu3kev/proof-pile-2-proofsteps"
)
ds
DatasetDict({
lean_proofsteps: Dataset({
features: ['text', 'meta'],
num_rows: 3432
})
isa_proofsteps: Dataset({
features: ['text', 'meta'],
num_rows: 260726
})
})
```
Quoting from appendix of [LLEMMA: AN OPEN LANGUAGE MODEL FOR MATHEMATICS](https://arxiv.org/pdf/2310.10631.pdf)
```
B.1.2 LEAN PROOFSTEPS
We extract a dataset of (tactic state, next tactic) pairs from Mathlib 4 (mathlib Community, 2020)
using the lean-training-data (Morrison, 2023) tool. We use Mathlib 4 commit c779bd5,
which was created on August 20th 2023.
B.1.3 ISABELLE PROOFSTEPS
We construct a dataset of Isabelle proofs, building upon the PISA dataset Jiang et al. (2021). Isabelle
Proofsteps comprises proofs from the Archive of Formal Proofs and Isabelle Standard Library, scraped
with PISA Jiang et al. (2021). Each entry in the dataset includes the theorem statement, the proof
states and the proof steps, separated by specific tags. To maintain the integrity of evaluations using
the PISA test set, we decontaminate Isabelle Proofsteps by removing theorems whose names overlap
with those in the PISA test set. Although this approach results in a strict filtering – removing more
than 10,000 theorems although there are only 3600 in the PISA test set – we consider it acceptable in
order to mitigate data contamination. After filtering, Isabelle Proofsteps contains 251,000 theorems.
```
提供机构:
xu3kev
原始信息汇总
数据集概述
数据集名称
Proof-Pile-2-Proofsteps
数据集来源
从Mathlib 4和Isabelle的证明步骤中提取。
数据集内容
数据集包含两个部分:
- lean_proofsteps: 包含3432条记录,每条记录包含text和meta两个特征。
- isa_proofsteps: 包含260726条记录,每条记录包含text和meta两个特征。
数据集描述
- lean_proofsteps: 从Mathlib 4中提取的(tactic state, next tactic)对,使用lean-training-data工具。
- isa_proofsteps: 从Archive of Formal Proofs和Isabelle Standard Library中提取的Isabelle证明步骤,使用PISA工具进行数据清洗,去除了与PISA测试集重叠的定理,最终包含251,000个定理。
数据集加载
python from datasets import load_dataset ds = load_dataset("xu3kev/proof-pile-2-proofsteps")



