DeepTheorem

Name: DeepTheorem
Creator: maas
Published: 2025-12-04 16:37:05
License: 暂无描述

魔搭社区2025-12-04 更新2025-06-07 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/DeepTheorem

下载链接

链接失效反馈

官方服务：

资源简介：

# DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning 🚀 Welcome to the GitHub repository for **DeepTheorem** 🎉, a comprehensive framework for enhancing large language model (LLM) mathematical reasoning through informal, natural language-based theorem proving. This project introduces a novel approach to automated theorem proving (ATP) by leveraging the informal reasoning strengths of LLMs, moving beyond traditional formal proof systems 🌟. More info can be found in the paper [DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning](https://huggingface.co/papers/2505.23754) ## Overview 📖 <p align="center"> <img src="frontpage.png" width="800" /> </p> Theorem proving is a critical benchmark for evaluating complex reasoning in LLMs 🧠. However, formal proof systems often misalign with the informal, natural language knowledge LLMs acquire during pre-training. DeepTheorem addresses this gap by introducing: 1. **A Large-Scale Theorem Proving Dataset** 📊: - Contains **high-quality, IMO-level informal theorems and proofs** across diverse mathematical domains 📚. - Rigorously annotated for correctness, difficulty, and topic categories ✅. - Includes systematically constructed **verifiable theorem variants** for robust evaluation 🔍. 2. **RL-Zero** 🤖: - A novel **reinforcement learning strategy** tailored for informal theorem proving ⚙️. - Utilizes verified theorem variants to incentivize robust mathematical inference 💡. Deeptheorem demonstrates its superior performance over both opensource and commercial models (i.e Gemini and o1) ## Performance 🚀 Deeptheorem achieves the #Rank 5 position along all the commerical models and open source models. | **Model** | **FIMO** | | **HMMT** | | **Putnam** | | **Avg.(\#Rank)** | | | :--------------------- | :------: | :-----: | :------: | :-----: | :--------: | :-----: | :------: | :-----: | | | *out.* | *proc.* | *out.* | *proc.* | *out.* | *proc.* | *out.* | *proc.* | | Gemini2\.5-Pro | 57\.14 | 54\.06 | 57\.63 | 49\.82 | 64\.58 | 58\.75 | 59\.78(\#2) | 54\.21(\#3) | | o1-mini | 60\.32 | 55\.23 | 35\.59 | 30\.90 | 61\.46 | 52\.88 | 52\.46(\#4) | 46\.34(\#4) | | o1 | 66\.67 | 61\.00 | 47\.46 | 47\.30 | 62\.50 | 57\.55 | 58\.88(\#3) | 55\.28(\#2) | | o3-mini | 80\.95 | 77\.61 | 45\.76 | 43\.47 | 78\.12 | 75\.12 | 68\.28(\#1) | 65\.40(\#1) | | *[DeepTheorem-RL-7B](https://huggingface.co/Jiahao004/DeepTheorem-qwen-7b-rl) | 55\.56 | 39\.07 | 28\.81 | 20\.85 | 57\.29 | 42\.20 | 47\.22(\#5) | 34\.04(\#5) | | *[DeepTheorem-RL-3B](https://huggingface.co/Jiahao004/DeepTheorem-qwen-3b-rl) | 38\.10 | 23\.39 | 25\.42 | 13\.56 | 52\.08 | 33\.84 | 38\.53 | 23.60 | | *[DeepTheorem-RL-1.5B](https://huggingface.co/Jiahao004/DeepTheorem-qwen-1.5b-rl) | 31\.75 | 15\.23 | 23\.73 | 10\.15 | 52\.08 | 22\.79 | 35\.85 | 16.06 | **Testing:** The testing set is available at [Jiahao004/HMMT_FIMO_Putnam](https://huggingface.co/datasets/Jiahao004/HMMT_FIMO_Putnam). Welcome to try and test your own models with our dataset! ## DeepTheorem Dataset 📊 The DeepTheorem dataset comprises **121K IMO-level informal theorems and proofs** spanning diverse mathematical domains 🌐. Each theorem-proof pair is rigorously annotated for: - **o3-mini Proofs** 🖋️: Ensuring mathematical accuracy through proofs generated or verified by the o3-mini model ✅. - **Truth Value** 🔍: The truth value of the theorem extracted from the o3-mini proofs, indicating whether the theorem is true or false ✔️. - **Difficulty** 🎚️: Categorized by complexity to suit various LLM capabilities 🧩. - **Topic Categories** 🗂️: Covering algebra, geometry, number theory, and more 📘. - **Variants** 🔄: Positive or negative variants of the theorem that share the same or inverse truth value of the original theorem 🔀.

# DeepTheorem：基于自然语言与强化学习推进定理证明的大语言模型推理 🚀 🎉 欢迎来到**DeepTheorem**的GitHub仓库：这是一套通过非形式化、基于自然语言的定理证明来增强大语言模型（Large Language Model，LLM）数学推理能力的综合性框架。本项目通过利用LLM的非形式化推理优势，提出了一种全新的自动化定理证明（Automated Theorem Proving，ATP）方案，突破了传统形式化证明系统的局限🌟。更多详情可查阅论文[DeepTheorem：基于自然语言与强化学习推进定理证明的大语言模型推理](https://huggingface.co/papers/2505.23754) ## 项目概览 📖 <p align="center"> <img src="frontpage.png" width="800" /> </p> 定理证明是评估LLM复杂推理能力的关键基准🧠。然而，传统形式化证明系统往往与LLM在预训练阶段习得的非形式化自然语言知识存在错位。DeepTheorem正是针对这一空白提出了以下解决方案： 1. **大规模定理证明数据集** 📊： - 涵盖多个数学领域的高质量国际数学奥林匹克（International Mathematical Olympiad，IMO）级非形式化定理与证明📚。 - 针对正确性、难度与主题类别进行了严格标注✅。 - 包含系统构建的**可验证定理变体**，用于稳健的模型评估🔍。 2. **RL-Zero** 🤖： - 一种专为非形式化定理证明设计的全新**强化学习（Reinforcement Learning）策略**⚙️。 - 利用已验证的定理变体来激励稳健的数学推理💡。 DeepTheorem在开源与商用模型（如Gemini与o1）上均展现出更优异的性能。 ## 模型性能 🚀 DeepTheorem在商用与开源模型榜单中均位列第5名。 | **模型** | **FIMO** | | **HMMT** | | **普特南（Putnam）** | | **平均得分（#排名）** | | | :--------------------- | :------: | :-----: | :------: | :-----: | :--------: | :-----: | :------: | :-----: | | | *输出结果* | *处理后结果* | *输出结果* | *处理后结果* | *输出结果* | *处理后结果* | *输出结果* | *处理后结果* | | Gemini2.5-Pro | 57.14 | 54.06 | 57.63 | 49.82 | 64.58 | 58.75 | 59.78（#2） | 54.21（#3） | | o1-mini | 60.32 | 55.23 | 35.59 | 30.90 | 61.46 | 52.88 | 52.46（#4） | 46.34（#4） | | o1 | 66.67 | 61.00 | 47.46 | 47.30 | 62.50 | 57.55 | 58.88（#3） | 55.28（#2） | | o3-mini | 80.95 | 77.61 | 45.76 | 43.47 | 78.12 | 75.12 | 68.28（#1） | 65.40（#1） | | *[DeepTheorem-RL-7B](https://huggingface.co/Jiahao004/DeepTheorem-qwen-7b-rl) | 55.56 | 39.07 | 28.81 | 20.85 | 57.29 | 42.20 | 47.22（#5） | 34.04（#5） | | *[DeepTheorem-RL-3B](https://huggingface.co/Jiahao004/DeepTheorem-qwen-3b-rl) | 38.10 | 23.39 | 25.42 | 13.56 | 52.08 | 33.84 | 38.53 | 23.60 | | *[DeepTheorem-RL-1.5B](https://huggingface.co/Jiahao004/DeepTheorem-qwen-1.5b-rl) | 31.75 | 15.23 | 23.73 | 10.15 | 52.08 | 22.79 | 35.85 | 16.06 | **测试集**：测试集可在[Jiahao004/HMMT_FIMO_Putnam](https://huggingface.co/datasets/Jiahao004/HMMT_FIMO_Putnam)获取。欢迎使用本数据集测试您自己的模型！ ## DeepTheorem 数据集 📊 DeepTheorem数据集包含**12.1万条IMO级非形式化定理与证明**，覆盖多个数学领域🌐。每一组定理-证明对均针对以下维度进行了严格标注： - **o3-mini 证明** 🖋️：通过o3-mini模型生成或验证的证明来确保数学正确性✅。 - **真值** 🔍：从o3-mini证明中提取的定理真值，用于指示该定理的真伪✔️。 - **难度等级** 🎚️：按复杂度进行分类，以适配不同能力的LLM🧩。 - **主题类别** 🗂️：涵盖代数、几何、数论等多个领域📘。 - **定理变体** 🔄：与原定理具有相同或相反真值的正/负定理变体🔀。

提供机构：

maas

创建时间：

2025-06-05

5,000+

优质数据集

54 个

任务类型

进入经典数据集