ReactorJet/prooflang

Name: ReactorJet/prooflang
Creator: ReactorJet
Published: 2026-04-06 01:15:52
License: 暂无描述

Hugging Face2026-04-06 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/ReactorJet/prooflang

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-generation language: - en size_categories: - 1B<n<10B pretty_name: ProofLang Corpus dataset_info: - config_name: proofs num_bytes: 3197091800 num_examples: 3681901 features: - name: fileID dtype: string - name: proof dtype: string - config_name: sentences num_bytes: 3736579062 num_examples: 38899130 features: - name: fileID dtype: string - name: sentence dtype: string download_size: 6933683563 dataset_size: 6933670862 --- # Dataset Card for the ProofLang Corpus ## Dataset Summary The ProofLang Corpus includes 3.7M proofs (558 million words) mechanically extracted from papers that were posted on [arXiv.org](https://arXiv.org) between 1992 and 2020. The focus of this corpus is proofs, rather than the explanatory text that surrounds them, and more specifically on the *language* used in such proofs. Specific mathematical content is filtered out, resulting in sentences such as `Let MATH be the restriction of MATH to MATH.` This dataset reflects how people prefer to write (non-formalized) proofs, and is also amenable to statistical analyses and experiments with Natural Language Processing (NLP) techniques. We hope it can serve as an aid in the development of language-based proof assistants and proof checkers for professional and educational purposes. ## Dataset Structure There are multiple TSV versions of the data. Primarily, `proofs` divides up the data proof-by-proof, and `sentences` further divides up the same data sentence-by-sentence. The `raw` dataset is a less-cleaned-up version of `proofs`. More usefully, the `tags` dataset gives arXiv subject tags for each paper ID found in the other data files. * The data in `proofs` (and `raw`) consists of a `paper` ID (identifying where the proof was extracted from), and the `proof` as a string. * The data in `sentences` consists of a `paper` ID, and the `sentence` as a string. * The data in `tags` consists of a `paper` ID, and the arXiv subject tags for that paper as a single comma-separated string. Further metadata about papers can be queried from arXiv.org using the paper ID. In particular, each paper `<id>` in the dataset can be accessed online at the url `https://arxiv.org/abs/<id>` ## Dataset Size * `proofs` is 3,094,779,182 bytes (unzipped) and has 3,681,893 examples. * `sentences` is 3,545,309,822 bytes (unzipped) and has 38,899,132 examples. * `tags` is 7,967,839 bytes (unzipped) and has 328,642 rows. * `raw` is 3,178,997,379 bytes (unzipped) and has 3,681,903 examples. ## Dataset Statistics * The average length of `sentences` is 14.1 words. * The average length of `proofs` is 10.5 sentences. ## Dataset Usage Data can be downloaded as (zipped) TSV files. Accessing the data programmatically from Python is also possible using the `Datasets` library. For example, to print the first 10 proofs: ```python from datasets import load_dataset dataset = load_dataset('proofcheck/prooflang', 'proofs', split='train', streaming='True') for d in dataset.take(10): print(d['paper'], d['proof']) ``` To look at individual sentences from the proofs, ```python dataset = load_dataset('proofcheck/prooflang', 'proofs', split='train', streaming='True') for d in dataset.take(10): print(d['paper'], d['sentence']) ``` To get a comma-separated list of arXiv subject tags for each paper, ```python from datasets import load_dataset dataset = load_dataset('proofcheck/prooflang', 'tags', split='train', streaming='True') for d in dataset.take(10): print(d['paper'], d['tags']) ``` Finally, to look at a version of the proofs with less aggressive cleanup (straight from the LaTeX extraction), ```python dataset = load_dataset('proofcheck/prooflang', 'raw', split='train', streaming='True') for d in dataset.take(10): print(d['paper'], d['proof']) ``` ### Data Splits There is currently no train/test split; all the data is in `train`. ## Dataset Creation We started with the LaTeX source of 1.6M papers that were submitted to [arXiv.org](https://arXiv.org) between 1992 and April 2022. The proofs were extracted using a Python script simulating parts of LaTeX (including defining and expanding macros). It does no actual typesetting, throws away output not between `\begin{proof}...\end{proof}`, and skips math content. During extraction, * Math-mode formulas (signalled by `$`, `\begin{equation}`, etc.) become `MATH` * `\ref{...}` and variants (`autoref`, `\subref`, etc.) become `REF` * `\cite{...}` and variants (`\Citet`, `\shortciteNP`, etc.) become `CITE` * Words that appear to be proper names become `NAME` * `\item` becomes `CASE:` We then run a cleanup pass on the extracted proofs that includes * Cleaning up common extraction errors (e.g., due to uninterpreted macros) * Replacing more references by `REF`, e.g., `Theorem 2(a)` or `Postulate (*)` * Replacing more citations with `CITE`, e.g., `Page 47 of CITE` * Replacing more proof-case markers with `CASE:`, e.g., `Case (a).` * Fixing a few common misspellings ## Additional Information This dataset is released under the Creative Commons Attribution 4.0 licence. Copyright for the actual proofs remains with the authors of the papers on [arXiv.org](https://arXiv.org), but these simplified snippets are fair use under US copyright law.

许可证：CC BY 4.0 任务类别： - 文本生成（text-generation）语言： - 英语（en）规模类别： - 10亿 < 词元数 < 100亿美观名称：ProofLang语料库（ProofLang Corpus）数据集信息： - 配置名称：proofs 字节数：3197091800 示例数：3681901 特征： - 名称：fileID 数据类型：字符串 - 名称：proof 数据类型：字符串 - 配置名称：sentences 字节数：3736579062 示例数：38899130 特征： - 名称：fileID 数据类型：字符串 - 名称：sentence 数据类型：字符串下载大小：6933683563 数据集大小：6933670862 # ProofLang语料库（ProofLang Corpus）数据集卡片 ## 数据集概览 ProofLang语料库包含370万条证明文本（共计5.58亿词），这些证明均为机械提取自1992年至2020年间发布于arXiv.org的学术论文。该语料库的核心聚焦对象为证明文本本身，而非其周边的解释性内容，更具体地说，聚焦于此类证明中所使用的**语言**。特定的数学内容已被过滤，最终得到形如`Let MATH be the restriction of MATH to MATH.`的语句，即`令MATH为MATH在MATH上的限制`。本数据集反映了人们撰写（非形式化）证明的习惯，同时也可用于基于自然语言处理（Natural Language Processing, NLP）技术的统计分析与实验。我们期望该数据集能够助力面向专业与教育场景的、基于语言的证明助手与证明检查器的开发工作。 ## 数据集结构该数据集包含多种制表符分隔值（Tab-Separated Values, TSV）格式版本。其中，`proofs`配置以单条证明为单位拆分数据，`sentences`配置则进一步将数据按语句粒度拆分。`raw`配置是`proofs`配置的未充分清洗版本。更具实用价值的是，`tags`配置可为其他数据文件中的每一个论文ID提供arXiv学科标签。 * `proofs`（及`raw`）配置中的数据包含`paper` ID（用于标识证明的来源论文）与作为字符串的`proof`证明文本。 * `sentences`配置中的数据包含`paper` ID与作为字符串的`sentence`语句文本。 * `tags`配置中的数据包含`paper` ID与以逗号分隔的该论文的arXiv学科标签字符串。可通过论文ID从arXiv.org查询该论文的更多元数据。具体而言，数据集中的每一篇论文`<id>`均可通过网址`https://arxiv.org/abs/<id>`在线访问。 ## 数据集规模 * `proofs`配置解压后大小为3094779182字节，包含3681893条示例。 * `sentences`配置解压后大小为3545309822字节，包含38899132条示例。 * `tags`配置解压后大小为7967839字节，包含328642条记录。 * `raw`配置解压后大小为3178997379字节，包含3681903条示例。 ## 数据集统计信息 * `sentences`配置中语句的平均长度为14.1个词。 * `proofs`配置中证明文本的平均长度为10.5条语句。 ## 数据集使用方式数据集可作为（压缩）TSV文件下载。也可通过`Datasets`库以编程方式从Python中访问该数据集。例如，若要打印前10条证明文本： python from datasets import load_dataset dataset = load_dataset('proofcheck/prooflang', 'proofs', split='train', streaming='True') for d in dataset.take(10): print(d['paper'], d['proof']) 若要查看证明中的单条语句： python dataset = load_dataset('proofcheck/prooflang', 'proofs', split='train', streaming='True') for d in dataset.take(10): print(d['paper'], d['sentence']) 若要获取每篇论文的逗号分隔格式arXiv学科标签列表： python from datasets import load_dataset dataset = load_dataset('proofcheck/prooflang', 'tags', split='train', streaming='True') for d in dataset.take(10): print(d['paper'], d['tags']) 最后，若要查看清洗程度较低的证明文本版本（直接来自LaTeX提取结果）： python from datasets import load_dataset dataset = load_dataset('proofcheck/prooflang', 'raw', split='train', streaming='True') for d in dataset.take(10): print(d['paper'], d['proof']) ### 数据划分目前尚未划分训练集与测试集，所有数据均位于`train`拆分中。 ## 数据集构建过程我们以1992年至2022年4月间提交至arXiv.org的160万篇论文的LaTeX源代码作为初始数据源。我们通过模拟LaTeX部分功能的Python脚本提取证明文本（包括宏的定义与展开）。该脚本不执行实际的排版操作，仅保留`egin{proof}...end{proof}`环境内的内容，并跳过数学内容。提取过程中： * 以`$`、`egin{equation}`等标记的数学模式公式将被替换为`MATH`。 * ` ef{...}`及其变体（如`autoref`、`subref`等）将被替换为`REF`。 * `cite{...}`及其变体（如`Citet`、`shortciteNP`等）将被替换为`CITE`。 * 疑似专有名词的词汇将被替换为`NAME`。 * `item`将被替换为`CASE:`。随后，我们对提取出的证明文本执行清洗操作，具体包括： * 修复常见的提取错误（如因未解析宏导致的问题）。 * 将更多引用替换为`REF`，例如`Theorem 2(a)`或`Postulate (*)`。 * 将更多引用文献标记替换为`CITE`，例如`Page 47 of CITE`。 * 将更多证明分情况标记替换为`CASE:`，例如`Case (a).`。 * 修正若干常见拼写错误。 ## 附加信息本数据集采用知识共享署名4.0（Creative Commons Attribution 4.0）许可协议发布。证明文本本身的版权归arXiv.org上论文的作者所有，但这些简化后的片段属于美国版权法规定的合理使用范围。

提供机构：

ReactorJet

5,000+

优质数据集

54 个

任务类型

进入经典数据集