uw-math-ai/theorem-search-dataset
收藏Hugging Face2026-02-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/uw-math-ai/theorem-search-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
task_categories:
- question-answering
- feature-extraction
- sentence-similarity
language:
- en
tags:
- math
- theorems
- semantic-search
- information-retrieval
- LaTeX
- arXiv
size_categories:
- 1M<n<10M
pretty_name: Theorem Search Dataset
arxiv: 2602.05216
configs:
- config_name: theorem
data_files:
- split: train
path: "theorem.parquet"
- config_name: theorem_slogan
data_files:
- split: train
path: "theorem_slogan.parquet"
- config_name: paper
data_files:
- split: train
path: "paper.parquet"
- config_name: theorem-test
data_files:
- split: train
path: "theorems-test.parquet"
---
# Theorem Search Dataset
The **largest** open corpus of informal mathematical theorems: **1,341,083 theorem statements** with natural-language slogans from **209,777 papers**, designed for semantic theorem retrieval.
**Paper:** [Semantic Search over 9 Million Mathematical Theorems](https://arxiv.org/abs/2602.05216)
**Demo:** [huggingface.co/spaces/uw-math-ai/theorem-search](https://huggingface.co/spaces/uw-math-ai/theorem-search)
## Benchmark results
On 110 test queries written by research mathematicians, our best pipeline (Qwen3-Embedding-8B on DeepSeek-V3.1 slogans) outperforms all existing tools for finding theorems in the literature:
| Method | Theorem Hit@20 | Paper Hit@20 |
|---|---|---|
| arXiv Search | -- | 2.7% |
| Google Search (`site:arxiv.org`) | -- | 37.8% |
| ChatGPT 5.2 w/ Search | 19.8% | -- |
| Gemini 3 Pro | 27.0% | -- |
| **Ours (Qwen3 8B)** | **45.0%** | **56.8%** |
## Why this dataset?
Mathematicians and AI proof assistants need to find whether a specific result already exists in the literature. Existing tools (Google Scholar, arXiv search, LLMs with web access) operate at the *paper* level, forcing users to manually scan documents for individual statements. This dataset enables **theorem-level** semantic search by pairing every theorem with a concise natural-language slogan.
## Dataset summary
| | Count |
|---|---|
| Theorems | 1,341,083 |
| Papers | 209,777 |
| Slogans | 1,341,083 |
| Test queries (human-written) | 110 |
**Sources** (by number of theorems): arXiv (1,302,109), ProofWiki (23,871), Stacks Project (12,693), Open Logic Project (745), CRing Project (546), Stacks and Moduli (506), HoTT Book (382), An Infinitely Large Napkin (231).
> **Note on dataset size:** The full corpus described in our paper contains over **9.2 million** theorems from approximately 690,000 arXiv papers. However, only ~15% of arXiv papers use permissive licenses (CC BY, CC BY-SA, or CC0). The remaining ~85% are submitted under arXiv's non-exclusive perpetual license, which does not permit redistribution. This dataset releases only the permissively-licensed subset. The full 9.2M-theorem corpus powers the [live demo](https://huggingface.co/spaces/uw-math-ai/theorem-search).
## Files
| File | Rows | Size | Description |
|---|---|---|---|
| `paper.parquet` | 209,777 | 145 MB | Paper metadata (title, authors, abstract, arXiv categories, citations, etc.) |
| `theorem.parquet` | 1,341,083 | 265 MB | Theorem statements in LaTeX (name, body, label, parsing method) |
| `theorem_slogan.parquet` | 1,341,083 | 192 MB | Natural-language slogans generated by DeepSeek-V3.1 |
| `theorems-test.parquet` | 110 | 12 KB | Human-written test queries with ground-truth theorem matches |
## Schema
### `paper.parquet`
| Column | Type | Description |
|---|---|---|
| `paper_id` | string | Unique paper identifier (arXiv ID or source-specific tag) |
| `title` | string | Paper title |
| `authors` | string | List of authors |
| `link` | string | URL to the paper |
| `last_updated` | string | Last update date (arXiv papers) |
| `summary` | string | Paper abstract |
| `journal_ref` | string | Journal reference, if published |
| `primary_category` | string | Primary arXiv category (e.g., `math.AG`) |
| `categories` | string | All arXiv categories |
| `citations` | int | Citation count (where available) |
| `source` | string | Provenance: `arXiv`, `Stacks Project`, `ProofWiki`, etc. |
| `license` | string | License of the source paper |
### `theorem.parquet`
| Column | Type | Description |
|---|---|---|
| `theorem_id` | int | Unique theorem identifier |
| `paper_id` | string | Foreign key to `paper.parquet` |
| `name` | string | Theorem name as displayed in source (e.g., "Theorem 3.1") |
| `body` | string | Full theorem statement in LaTeX |
| `label` | string | LaTeX `\label{}` tag from the source |
| `link` | string | Direct URL to the theorem (where available) |
| `parsing_method` | string | How the theorem was extracted: `plastex`, `tex`, `regex`, or `manual` |
### `theorem_slogan.parquet`
| Column | Type | Description |
|---|---|---|
| `slogan_id` | int | Unique slogan identifier |
| `theorem_id` | int | Foreign key to `theorem.parquet` |
| `model` | string | LLM used to generate the slogan (`DeepSeek-V3.1`) |
| `prompt_id` | string | Prompt variant used (`body-only-v1`) |
| `slogan` | string | Natural-language summary of the theorem |
### `theorems-test.parquet`
| Column | Type | Description |
|---|---|---|
| `query` | string | Natural-language query written blind by a research mathematician |
| `theorem number` | string | Ground-truth theorem identifier (e.g., "Theorem 3.1") |
| `paper title` | string | Title of the paper containing the target theorem |
| `link to paper on arxiv` | string | arXiv URL of the target paper |
## Example
A real entry from the dataset ([Stacks Project, Lemma 10.52.3](https://stacks.math.columbia.edu/tag/00IV)):
> **Theorem body (LaTeX):**
> If $0 \to M' \to M \to M'' \to 0$ is a short exact sequence of modules over $R$ then the length of $M$ is the sum of the lengths of $M'$ and $M''$.
>
> **Generated slogan:**
> Length is additive in short exact sequences. For a short exact sequence of modules from zero to M prime to M to M double prime to zero, the length of M equals the sum of the lengths of M prime and M double prime.
## How theorems were extracted
Theorems were parsed from LaTeX sources using three strategies, applied in fallback order:
1. **plasTeX** (956,194 theorems): LaTeX sources are converted into a structured node tree using [plasTeX](https://github.com/plastex/plastex). Theorem environments are identified and their name, number, and body extracted from node metadata.
2. **TeX logging** (296,239 theorems): A custom LaTeX package is injected that logs theorem data during compilation, capturing the theorem type, number, and body.
3. **Regex** (88,607 theorems): Regular expressions identify theorem delimiter tokens (`\begin{theorem}...\end{theorem}`, `\proclaim...\endproclaim`, etc.) and extract the body.
The remaining 43 theorems were entered manually. Author-defined macros (e.g., `\R` for `\mathbb{R}`) are expanded in theorem bodies. Malformed extractions (e.g., truncated bodies shorter than 8 characters) are filtered out.
## How slogans were generated
Each theorem body is passed to **DeepSeek-V3.1** with a prompt instructing the model to produce a concise, declarative English description of the theorem's main result, avoiding symbolic notation, proof details, and references to the surrounding document. Temperature is fixed at 0.2 with a maximum of 1024 output tokens.
## Test set
The test set contains **110 queries** written by four research mathematicians (Giovanni Inchiostro, Dori Bejleri, Michele Pernice, Ignacio Tejeda) across 14 arXiv tags, primarily Algebraic Geometry (`math.AG`), Analysis (`math.AP`), and Geometric Measure Theory (`math.CA`, `math.MG`).
## Quick start
```python
# Using Hugging Face datasets
from datasets import load_dataset
ds = load_dataset("uw-math-ai/theorem-search-dataset")
```
## API
TheoremSearch provides a production REST API for semantic theorem search.
**Endpoint**
```
POST [https://api.theoremsearch.com/search](https://api.theoremsearch.com/search)
````
**Example**
```bash
curl https://api.theoremsearch.com/search \
-H "Content-Type: application/json" \
-d '{
"query": "smooth DM stack codimension one",
"n_results": 5
}'
````
Returns a JSON object containing theorem-level results with metadata and similarity scores.
## MCP (Model Context Protocol)
TheoremSearch is also available as an MCP tool for AI agents.
**Endpoint**
```
https://api.theoremsearch.com/mcp
```
The MCP server exposes a single tool:
* `theorem_search` — semantic theorem search with optional filters.
### Using with Claude Code
Add the following to your Claude MCP configuration:
```json
{
"mcpServers": {
"theoremsearch": {
"url": "https://api.theoremsearch.com/mcp"
}
}
}
```
## Citation
```bibtex
@article{theoremsearch2026,
title = {Semantic Search over 9 Million Mathematical Theorems},
author = {Alexander, Luke and Leonen, Eric and Szeto, Sophie and Remizov, Artemii and Tejeda, Ignacio and Inchiostro, Giovanni and Ilin, Vasily},
journal = {arXiv preprint},
year = {2026},
}
```
## Contact
For questions or issues, contact [Vasily Ilin](https://vilin97.github.io/).
提供机构:
uw-math-ai



