uw-math-ai/theorem-search-dataset

Name: uw-math-ai/theorem-search-dataset
Creator: uw-math-ai
Published: 2026-02-20 00:51:18
License: 暂无描述

Hugging Face2026-02-20 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/uw-math-ai/theorem-search-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-sa-4.0 task_categories: - question-answering - feature-extraction - sentence-similarity language: - en tags: - math - theorems - semantic-search - information-retrieval - LaTeX - arXiv size_categories: - 1M<n<10M pretty_name: Theorem Search Dataset arxiv: 2602.05216 configs: - config_name: theorem data_files: - split: train path: "theorem.parquet" - config_name: theorem_slogan data_files: - split: train path: "theorem_slogan.parquet" - config_name: paper data_files: - split: train path: "paper.parquet" - config_name: theorem-test data_files: - split: train path: "theorems-test.parquet" --- # Theorem Search Dataset The **largest** open corpus of informal mathematical theorems: **1,341,083 theorem statements** with natural-language slogans from **209,777 papers**, designed for semantic theorem retrieval. **Paper:** [Semantic Search over 9 Million Mathematical Theorems](https://arxiv.org/abs/2602.05216) **Demo:** [huggingface.co/spaces/uw-math-ai/theorem-search](https://huggingface.co/spaces/uw-math-ai/theorem-search) ## Benchmark results On 110 test queries written by research mathematicians, our best pipeline (Qwen3-Embedding-8B on DeepSeek-V3.1 slogans) outperforms all existing tools for finding theorems in the literature: | Method | Theorem Hit@20 | Paper Hit@20 | |---|---|---| | arXiv Search | -- | 2.7% | | Google Search (`site:arxiv.org`) | -- | 37.8% | | ChatGPT 5.2 w/ Search | 19.8% | -- | | Gemini 3 Pro | 27.0% | -- | | **Ours (Qwen3 8B)** | **45.0%** | **56.8%** | ## Why this dataset? Mathematicians and AI proof assistants need to find whether a specific result already exists in the literature. Existing tools (Google Scholar, arXiv search, LLMs with web access) operate at the *paper* level, forcing users to manually scan documents for individual statements. This dataset enables **theorem-level** semantic search by pairing every theorem with a concise natural-language slogan. ## Dataset summary | | Count | |---|---| | Theorems | 1,341,083 | | Papers | 209,777 | | Slogans | 1,341,083 | | Test queries (human-written) | 110 | **Sources** (by number of theorems): arXiv (1,302,109), ProofWiki (23,871), Stacks Project (12,693), Open Logic Project (745), CRing Project (546), Stacks and Moduli (506), HoTT Book (382), An Infinitely Large Napkin (231). > **Note on dataset size:** The full corpus described in our paper contains over **9.2 million** theorems from approximately 690,000 arXiv papers. However, only ~15% of arXiv papers use permissive licenses (CC BY, CC BY-SA, or CC0). The remaining ~85% are submitted under arXiv's non-exclusive perpetual license, which does not permit redistribution. This dataset releases only the permissively-licensed subset. The full 9.2M-theorem corpus powers the [live demo](https://huggingface.co/spaces/uw-math-ai/theorem-search). ## Files | File | Rows | Size | Description | |---|---|---|---| | `paper.parquet` | 209,777 | 145 MB | Paper metadata (title, authors, abstract, arXiv categories, citations, etc.) | | `theorem.parquet` | 1,341,083 | 265 MB | Theorem statements in LaTeX (name, body, label, parsing method) | | `theorem_slogan.parquet` | 1,341,083 | 192 MB | Natural-language slogans generated by DeepSeek-V3.1 | | `theorems-test.parquet` | 110 | 12 KB | Human-written test queries with ground-truth theorem matches | ## Schema ### `paper.parquet` | Column | Type | Description | |---|---|---| | `paper_id` | string | Unique paper identifier (arXiv ID or source-specific tag) | | `title` | string | Paper title | | `authors` | string | List of authors | | `link` | string | URL to the paper | | `last_updated` | string | Last update date (arXiv papers) | | `summary` | string | Paper abstract | | `journal_ref` | string | Journal reference, if published | | `primary_category` | string | Primary arXiv category (e.g., `math.AG`) | | `categories` | string | All arXiv categories | | `citations` | int | Citation count (where available) | | `source` | string | Provenance: `arXiv`, `Stacks Project`, `ProofWiki`, etc. | | `license` | string | License of the source paper | ### `theorem.parquet` | Column | Type | Description | |---|---|---| | `theorem_id` | int | Unique theorem identifier | | `paper_id` | string | Foreign key to `paper.parquet` | | `name` | string | Theorem name as displayed in source (e.g., "Theorem 3.1") | | `body` | string | Full theorem statement in LaTeX | | `label` | string | LaTeX `\label{}` tag from the source | | `link` | string | Direct URL to the theorem (where available) | | `parsing_method` | string | How the theorem was extracted: `plastex`, `tex`, `regex`, or `manual` | ### `theorem_slogan.parquet` | Column | Type | Description | |---|---|---| | `slogan_id` | int | Unique slogan identifier | | `theorem_id` | int | Foreign key to `theorem.parquet` | | `model` | string | LLM used to generate the slogan (`DeepSeek-V3.1`) | | `prompt_id` | string | Prompt variant used (`body-only-v1`) | | `slogan` | string | Natural-language summary of the theorem | ### `theorems-test.parquet` | Column | Type | Description | |---|---|---| | `query` | string | Natural-language query written blind by a research mathematician | | `theorem number` | string | Ground-truth theorem identifier (e.g., "Theorem 3.1") | | `paper title` | string | Title of the paper containing the target theorem | | `link to paper on arxiv` | string | arXiv URL of the target paper | ## Example A real entry from the dataset ([Stacks Project, Lemma 10.52.3](https://stacks.math.columbia.edu/tag/00IV)): > **Theorem body (LaTeX):** > If $0 \to M' \to M \to M'' \to 0$ is a short exact sequence of modules over $R$ then the length of $M$ is the sum of the lengths of $M'$ and $M''$. > > **Generated slogan:** > Length is additive in short exact sequences. For a short exact sequence of modules from zero to M prime to M to M double prime to zero, the length of M equals the sum of the lengths of M prime and M double prime. ## How theorems were extracted Theorems were parsed from LaTeX sources using three strategies, applied in fallback order: 1. **plasTeX** (956,194 theorems): LaTeX sources are converted into a structured node tree using [plasTeX](https://github.com/plastex/plastex). Theorem environments are identified and their name, number, and body extracted from node metadata. 2. **TeX logging** (296,239 theorems): A custom LaTeX package is injected that logs theorem data during compilation, capturing the theorem type, number, and body. 3. **Regex** (88,607 theorems): Regular expressions identify theorem delimiter tokens (`\begin{theorem}...\end{theorem}`, `\proclaim...\endproclaim`, etc.) and extract the body. The remaining 43 theorems were entered manually. Author-defined macros (e.g., `\R` for `\mathbb{R}`) are expanded in theorem bodies. Malformed extractions (e.g., truncated bodies shorter than 8 characters) are filtered out. ## How slogans were generated Each theorem body is passed to **DeepSeek-V3.1** with a prompt instructing the model to produce a concise, declarative English description of the theorem's main result, avoiding symbolic notation, proof details, and references to the surrounding document. Temperature is fixed at 0.2 with a maximum of 1024 output tokens. ## Test set The test set contains **110 queries** written by four research mathematicians (Giovanni Inchiostro, Dori Bejleri, Michele Pernice, Ignacio Tejeda) across 14 arXiv tags, primarily Algebraic Geometry (`math.AG`), Analysis (`math.AP`), and Geometric Measure Theory (`math.CA`, `math.MG`). ## Quick start ```python # Using Hugging Face datasets from datasets import load_dataset ds = load_dataset("uw-math-ai/theorem-search-dataset") ``` ## API TheoremSearch provides a production REST API for semantic theorem search. **Endpoint** ``` POST [https://api.theoremsearch.com/search](https://api.theoremsearch.com/search) ```` **Example** ```bash curl https://api.theoremsearch.com/search \ -H "Content-Type: application/json" \ -d '{ "query": "smooth DM stack codimension one", "n_results": 5 }' ```` Returns a JSON object containing theorem-level results with metadata and similarity scores. ## MCP (Model Context Protocol) TheoremSearch is also available as an MCP tool for AI agents. **Endpoint** ``` https://api.theoremsearch.com/mcp ``` The MCP server exposes a single tool: * `theorem_search` — semantic theorem search with optional filters. ### Using with Claude Code Add the following to your Claude MCP configuration: ```json { "mcpServers": { "theoremsearch": { "url": "https://api.theoremsearch.com/mcp" } } } ``` ## Citation ```bibtex @article{theoremsearch2026, title = {Semantic Search over 9 Million Mathematical Theorems}, author = {Alexander, Luke and Leonen, Eric and Szeto, Sophie and Remizov, Artemii and Tejeda, Ignacio and Inchiostro, Giovanni and Ilin, Vasily}, journal = {arXiv preprint}, year = {2026}, } ``` ## Contact For questions or issues, contact [Vasily Ilin](https://vilin97.github.io/).

提供机构：

uw-math-ai

5,000+

优质数据集

54 个

任务类型

进入经典数据集