DeepTheorem
收藏魔搭社区2025-12-04 更新2025-06-07 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/DeepTheorem
下载链接
链接失效反馈官方服务:
资源简介:
# DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning 🚀
Welcome to the GitHub repository for **DeepTheorem** 🎉, a comprehensive framework for enhancing large language model (LLM) mathematical reasoning through informal, natural language-based theorem proving. This project introduces a novel approach to automated theorem proving (ATP) by leveraging the informal reasoning strengths of LLMs, moving beyond traditional formal proof systems 🌟.
More info can be found in the paper [DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning](https://huggingface.co/papers/2505.23754)
## Overview 📖
<p align="center">
<img src="frontpage.png" width="800" />
</p>
Theorem proving is a critical benchmark for evaluating complex reasoning in LLMs 🧠. However, formal proof systems often misalign with the informal, natural language knowledge LLMs acquire during pre-training. DeepTheorem addresses this gap by introducing:
1. **A Large-Scale Theorem Proving Dataset** 📊:
- Contains **high-quality, IMO-level informal theorems and proofs** across diverse mathematical domains 📚.
- Rigorously annotated for correctness, difficulty, and topic categories ✅.
- Includes systematically constructed **verifiable theorem variants** for robust evaluation 🔍.
2. **RL-Zero** 🤖:
- A novel **reinforcement learning strategy** tailored for informal theorem proving ⚙️.
- Utilizes verified theorem variants to incentivize robust mathematical inference 💡.
Deeptheorem demonstrates its superior performance over both opensource and commercial models (i.e Gemini and o1)
## Performance 🚀
Deeptheorem achieves the #Rank 5 position along all the commerical models and open source models.
| **Model** | **FIMO** | | **HMMT** | | **Putnam** | | **Avg.(\#Rank)** | |
| :--------------------- | :------: | :-----: | :------: | :-----: | :--------: | :-----: | :------: | :-----: |
| | *out.* | *proc.* | *out.* | *proc.* | *out.* | *proc.* | *out.* | *proc.* |
| Gemini2\.5-Pro | 57\.14 | 54\.06 | 57\.63 | 49\.82 | 64\.58 | 58\.75 | 59\.78(\#2) | 54\.21(\#3) |
| o1-mini | 60\.32 | 55\.23 | 35\.59 | 30\.90 | 61\.46 | 52\.88 | 52\.46(\#4) | 46\.34(\#4) |
| o1 | 66\.67 | 61\.00 | 47\.46 | 47\.30 | 62\.50 | 57\.55 | 58\.88(\#3) | 55\.28(\#2) |
| o3-mini | 80\.95 | 77\.61 | 45\.76 | 43\.47 | 78\.12 | 75\.12 | 68\.28(\#1) | 65\.40(\#1) |
| *[DeepTheorem-RL-7B](https://huggingface.co/Jiahao004/DeepTheorem-qwen-7b-rl) | 55\.56 | 39\.07 | 28\.81 | 20\.85 | 57\.29 | 42\.20 | 47\.22(\#5) | 34\.04(\#5) |
| *[DeepTheorem-RL-3B](https://huggingface.co/Jiahao004/DeepTheorem-qwen-3b-rl) | 38\.10 | 23\.39 | 25\.42 | 13\.56 | 52\.08 | 33\.84 | 38\.53 | 23.60 |
| *[DeepTheorem-RL-1.5B](https://huggingface.co/Jiahao004/DeepTheorem-qwen-1.5b-rl) | 31\.75 | 15\.23 | 23\.73 | 10\.15 | 52\.08 | 22\.79 | 35\.85 | 16.06 |
**Testing:** The testing set is available at [Jiahao004/HMMT_FIMO_Putnam](https://huggingface.co/datasets/Jiahao004/HMMT_FIMO_Putnam). Welcome to try and test your own models with our dataset!
## DeepTheorem Dataset 📊
The DeepTheorem dataset comprises **121K IMO-level informal theorems and proofs** spanning diverse mathematical domains 🌐. Each theorem-proof pair is rigorously annotated for:
- **o3-mini Proofs** 🖋️: Ensuring mathematical accuracy through proofs generated or verified by the o3-mini model ✅.
- **Truth Value** 🔍: The truth value of the theorem extracted from the o3-mini proofs, indicating whether the theorem is true or false ✔️.
- **Difficulty** 🎚️: Categorized by complexity to suit various LLM capabilities 🧩.
- **Topic Categories** 🗂️: Covering algebra, geometry, number theory, and more 📘.
- **Variants** 🔄: Positive or negative variants of the theorem that share the same or inverse truth value of the original theorem 🔀.
# DeepTheorem:基于自然语言与强化学习推进定理证明的大语言模型推理 🚀
🎉 欢迎来到**DeepTheorem**的GitHub仓库:这是一套通过非形式化、基于自然语言的定理证明来增强大语言模型(Large Language Model,LLM)数学推理能力的综合性框架。本项目通过利用LLM的非形式化推理优势,提出了一种全新的自动化定理证明(Automated Theorem Proving,ATP)方案,突破了传统形式化证明系统的局限🌟。
更多详情可查阅论文[DeepTheorem:基于自然语言与强化学习推进定理证明的大语言模型推理](https://huggingface.co/papers/2505.23754)
## 项目概览 📖
<p align="center">
<img src="frontpage.png" width="800" />
</p>
定理证明是评估LLM复杂推理能力的关键基准🧠。然而,传统形式化证明系统往往与LLM在预训练阶段习得的非形式化自然语言知识存在错位。DeepTheorem正是针对这一空白提出了以下解决方案:
1. **大规模定理证明数据集** 📊:
- 涵盖多个数学领域的高质量国际数学奥林匹克(International Mathematical Olympiad,IMO)级非形式化定理与证明📚。
- 针对正确性、难度与主题类别进行了严格标注✅。
- 包含系统构建的**可验证定理变体**,用于稳健的模型评估🔍。
2. **RL-Zero** 🤖:
- 一种专为非形式化定理证明设计的全新**强化学习(Reinforcement Learning)策略**⚙️。
- 利用已验证的定理变体来激励稳健的数学推理💡。
DeepTheorem在开源与商用模型(如Gemini与o1)上均展现出更优异的性能。
## 模型性能 🚀
DeepTheorem在商用与开源模型榜单中均位列第5名。
| **模型** | **FIMO** | | **HMMT** | | **普特南(Putnam)** | | **平均得分(#排名)** | |
| :--------------------- | :------: | :-----: | :------: | :-----: | :--------: | :-----: | :------: | :-----: |
| | *输出结果* | *处理后结果* | *输出结果* | *处理后结果* | *输出结果* | *处理后结果* | *输出结果* | *处理后结果* |
| Gemini2.5-Pro | 57.14 | 54.06 | 57.63 | 49.82 | 64.58 | 58.75 | 59.78(#2) | 54.21(#3) |
| o1-mini | 60.32 | 55.23 | 35.59 | 30.90 | 61.46 | 52.88 | 52.46(#4) | 46.34(#4) |
| o1 | 66.67 | 61.00 | 47.46 | 47.30 | 62.50 | 57.55 | 58.88(#3) | 55.28(#2) |
| o3-mini | 80.95 | 77.61 | 45.76 | 43.47 | 78.12 | 75.12 | 68.28(#1) | 65.40(#1) |
| *[DeepTheorem-RL-7B](https://huggingface.co/Jiahao004/DeepTheorem-qwen-7b-rl) | 55.56 | 39.07 | 28.81 | 20.85 | 57.29 | 42.20 | 47.22(#5) | 34.04(#5) |
| *[DeepTheorem-RL-3B](https://huggingface.co/Jiahao004/DeepTheorem-qwen-3b-rl) | 38.10 | 23.39 | 25.42 | 13.56 | 52.08 | 33.84 | 38.53 | 23.60 |
| *[DeepTheorem-RL-1.5B](https://huggingface.co/Jiahao004/DeepTheorem-qwen-1.5b-rl) | 31.75 | 15.23 | 23.73 | 10.15 | 52.08 | 22.79 | 35.85 | 16.06 |
**测试集**:测试集可在[Jiahao004/HMMT_FIMO_Putnam](https://huggingface.co/datasets/Jiahao004/HMMT_FIMO_Putnam)获取。欢迎使用本数据集测试您自己的模型!
## DeepTheorem 数据集 📊
DeepTheorem数据集包含**12.1万条IMO级非形式化定理与证明**,覆盖多个数学领域🌐。每一组定理-证明对均针对以下维度进行了严格标注:
- **o3-mini 证明** 🖋️:通过o3-mini模型生成或验证的证明来确保数学正确性✅。
- **真值** 🔍:从o3-mini证明中提取的定理真值,用于指示该定理的真伪✔️。
- **难度等级** 🎚️:按复杂度进行分类,以适配不同能力的LLM🧩。
- **主题类别** 🗂️:涵盖代数、几何、数论等多个领域📘。
- **定理变体** 🔄:与原定理具有相同或相反真值的正/负定理变体🔀。
提供机构:
maas
创建时间:
2025-06-05



