five

ReactorJet/prooflang

收藏
Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ReactorJet/prooflang
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-generation language: - en size_categories: - 1B<n<10B pretty_name: ProofLang Corpus dataset_info: - config_name: proofs num_bytes: 3197091800 num_examples: 3681901 features: - name: fileID dtype: string - name: proof dtype: string - config_name: sentences num_bytes: 3736579062 num_examples: 38899130 features: - name: fileID dtype: string - name: sentence dtype: string download_size: 6933683563 dataset_size: 6933670862 --- # Dataset Card for the ProofLang Corpus ## Dataset Summary The ProofLang Corpus includes 3.7M proofs (558 million words) mechanically extracted from papers that were posted on [arXiv.org](https://arXiv.org) between 1992 and 2020. The focus of this corpus is proofs, rather than the explanatory text that surrounds them, and more specifically on the *language* used in such proofs. Specific mathematical content is filtered out, resulting in sentences such as `Let MATH be the restriction of MATH to MATH.` This dataset reflects how people prefer to write (non-formalized) proofs, and is also amenable to statistical analyses and experiments with Natural Language Processing (NLP) techniques. We hope it can serve as an aid in the development of language-based proof assistants and proof checkers for professional and educational purposes. ## Dataset Structure There are multiple TSV versions of the data. Primarily, `proofs` divides up the data proof-by-proof, and `sentences` further divides up the same data sentence-by-sentence. The `raw` dataset is a less-cleaned-up version of `proofs`. More usefully, the `tags` dataset gives arXiv subject tags for each paper ID found in the other data files. * The data in `proofs` (and `raw`) consists of a `paper` ID (identifying where the proof was extracted from), and the `proof` as a string. * The data in `sentences` consists of a `paper` ID, and the `sentence` as a string. * The data in `tags` consists of a `paper` ID, and the arXiv subject tags for that paper as a single comma-separated string. Further metadata about papers can be queried from arXiv.org using the paper ID. In particular, each paper `<id>` in the dataset can be accessed online at the url `https://arxiv.org/abs/<id>` ## Dataset Size * `proofs` is 3,094,779,182 bytes (unzipped) and has 3,681,893 examples. * `sentences` is 3,545,309,822 bytes (unzipped) and has 38,899,132 examples. * `tags` is 7,967,839 bytes (unzipped) and has 328,642 rows. * `raw` is 3,178,997,379 bytes (unzipped) and has 3,681,903 examples. ## Dataset Statistics * The average length of `sentences` is 14.1 words. * The average length of `proofs` is 10.5 sentences. ## Dataset Usage Data can be downloaded as (zipped) TSV files. Accessing the data programmatically from Python is also possible using the `Datasets` library. For example, to print the first 10 proofs: ```python from datasets import load_dataset dataset = load_dataset('proofcheck/prooflang', 'proofs', split='train', streaming='True') for d in dataset.take(10): print(d['paper'], d['proof']) ``` To look at individual sentences from the proofs, ```python dataset = load_dataset('proofcheck/prooflang', 'proofs', split='train', streaming='True') for d in dataset.take(10): print(d['paper'], d['sentence']) ``` To get a comma-separated list of arXiv subject tags for each paper, ```python from datasets import load_dataset dataset = load_dataset('proofcheck/prooflang', 'tags', split='train', streaming='True') for d in dataset.take(10): print(d['paper'], d['tags']) ``` Finally, to look at a version of the proofs with less aggressive cleanup (straight from the LaTeX extraction), ```python dataset = load_dataset('proofcheck/prooflang', 'raw', split='train', streaming='True') for d in dataset.take(10): print(d['paper'], d['proof']) ``` ### Data Splits There is currently no train/test split; all the data is in `train`. ## Dataset Creation We started with the LaTeX source of 1.6M papers that were submitted to [arXiv.org](https://arXiv.org) between 1992 and April 2022. The proofs were extracted using a Python script simulating parts of LaTeX (including defining and expanding macros). It does no actual typesetting, throws away output not between `\begin{proof}...\end{proof}`, and skips math content. During extraction, * Math-mode formulas (signalled by `$`, `\begin{equation}`, etc.) become `MATH` * `\ref{...}` and variants (`autoref`, `\subref`, etc.) become `REF` * `\cite{...}` and variants (`\Citet`, `\shortciteNP`, etc.) become `CITE` * Words that appear to be proper names become `NAME` * `\item` becomes `CASE:` We then run a cleanup pass on the extracted proofs that includes * Cleaning up common extraction errors (e.g., due to uninterpreted macros) * Replacing more references by `REF`, e.g., `Theorem 2(a)` or `Postulate (*)` * Replacing more citations with `CITE`, e.g., `Page 47 of CITE` * Replacing more proof-case markers with `CASE:`, e.g., `Case (a).` * Fixing a few common misspellings ## Additional Information This dataset is released under the Creative Commons Attribution 4.0 licence. Copyright for the actual proofs remains with the authors of the papers on [arXiv.org](https://arXiv.org), but these simplified snippets are fair use under US copyright law.

许可证:CC BY 4.0 任务类别: - 文本生成(text-generation) 语言: - 英语(en) 规模类别: - 10亿 < 词元数 < 100亿 美观名称:ProofLang语料库(ProofLang Corpus) 数据集信息: - 配置名称:proofs 字节数:3197091800 示例数:3681901 特征: - 名称:fileID 数据类型:字符串 - 名称:proof 数据类型:字符串 - 配置名称:sentences 字节数:3736579062 示例数:38899130 特征: - 名称:fileID 数据类型:字符串 - 名称:sentence 数据类型:字符串 下载大小:6933683563 数据集大小:6933670862 # ProofLang语料库(ProofLang Corpus)数据集卡片 ## 数据集概览 ProofLang语料库包含370万条证明文本(共计5.58亿词),这些证明均为机械提取自1992年至2020年间发布于arXiv.org的学术论文。 该语料库的核心聚焦对象为证明文本本身,而非其周边的解释性内容,更具体地说,聚焦于此类证明中所使用的**语言**。 特定的数学内容已被过滤,最终得到形如`Let MATH be the restriction of MATH to MATH.`的语句,即`令MATH为MATH在MATH上的限制`。 本数据集反映了人们撰写(非形式化)证明的习惯,同时也可用于基于自然语言处理(Natural Language Processing, NLP)技术的统计分析与实验。我们期望该数据集能够助力面向专业与教育场景的、基于语言的证明助手与证明检查器的开发工作。 ## 数据集结构 该数据集包含多种制表符分隔值(Tab-Separated Values, TSV)格式版本。其中,`proofs`配置以单条证明为单位拆分数据,`sentences`配置则进一步将数据按语句粒度拆分。`raw`配置是`proofs`配置的未充分清洗版本。更具实用价值的是,`tags`配置可为其他数据文件中的每一个论文ID提供arXiv学科标签。 * `proofs`(及`raw`)配置中的数据包含`paper` ID(用于标识证明的来源论文)与作为字符串的`proof`证明文本。 * `sentences`配置中的数据包含`paper` ID与作为字符串的`sentence`语句文本。 * `tags`配置中的数据包含`paper` ID与以逗号分隔的该论文的arXiv学科标签字符串。 可通过论文ID从arXiv.org查询该论文的更多元数据。具体而言,数据集中的每一篇论文`<id>`均可通过网址`https://arxiv.org/abs/<id>`在线访问。 ## 数据集规模 * `proofs`配置解压后大小为3094779182字节,包含3681893条示例。 * `sentences`配置解压后大小为3545309822字节,包含38899132条示例。 * `tags`配置解压后大小为7967839字节,包含328642条记录。 * `raw`配置解压后大小为3178997379字节,包含3681903条示例。 ## 数据集统计信息 * `sentences`配置中语句的平均长度为14.1个词。 * `proofs`配置中证明文本的平均长度为10.5条语句。 ## 数据集使用方式 数据集可作为(压缩)TSV文件下载。也可通过`Datasets`库以编程方式从Python中访问该数据集。例如,若要打印前10条证明文本: python from datasets import load_dataset dataset = load_dataset('proofcheck/prooflang', 'proofs', split='train', streaming='True') for d in dataset.take(10): print(d['paper'], d['proof']) 若要查看证明中的单条语句: python dataset = load_dataset('proofcheck/prooflang', 'proofs', split='train', streaming='True') for d in dataset.take(10): print(d['paper'], d['sentence']) 若要获取每篇论文的逗号分隔格式arXiv学科标签列表: python from datasets import load_dataset dataset = load_dataset('proofcheck/prooflang', 'tags', split='train', streaming='True') for d in dataset.take(10): print(d['paper'], d['tags']) 最后,若要查看清洗程度较低的证明文本版本(直接来自LaTeX提取结果): python from datasets import load_dataset dataset = load_dataset('proofcheck/prooflang', 'raw', split='train', streaming='True') for d in dataset.take(10): print(d['paper'], d['proof']) ### 数据划分 目前尚未划分训练集与测试集,所有数据均位于`train`拆分中。 ## 数据集构建过程 我们以1992年至2022年4月间提交至arXiv.org的160万篇论文的LaTeX源代码作为初始数据源。我们通过模拟LaTeX部分功能的Python脚本提取证明文本(包括宏的定义与展开)。该脚本不执行实际的排版操作,仅保留`egin{proof}...end{proof}`环境内的内容,并跳过数学内容。提取过程中: * 以`$`、`egin{equation}`等标记的数学模式公式将被替换为`MATH`。 * ` ef{...}`及其变体(如`autoref`、`subref`等)将被替换为`REF`。 * `cite{...}`及其变体(如`Citet`、`shortciteNP`等)将被替换为`CITE`。 * 疑似专有名词的词汇将被替换为`NAME`。 * `item`将被替换为`CASE:`。 随后,我们对提取出的证明文本执行清洗操作,具体包括: * 修复常见的提取错误(如因未解析宏导致的问题)。 * 将更多引用替换为`REF`,例如`Theorem 2(a)`或`Postulate (*)`。 * 将更多引用文献标记替换为`CITE`,例如`Page 47 of CITE`。 * 将更多证明分情况标记替换为`CASE:`,例如`Case (a).`。 * 修正若干常见拼写错误。 ## 附加信息 本数据集采用知识共享署名4.0(Creative Commons Attribution 4.0)许可协议发布。 证明文本本身的版权归arXiv.org上论文的作者所有,但这些简化后的片段属于美国版权法规定的合理使用范围。
提供机构:
ReactorJet
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作