Kimina-Prover-Promptset

Name: Kimina-Prover-Promptset
Creator: maas
Published: 2026-01-06 16:42:44
License: 暂无描述

魔搭社区2026-01-06 更新2025-09-27 收录

下载链接：

https://modelscope.cn/datasets/AI-MO/Kimina-Prover-Promptset

下载链接

链接失效反馈

官方服务：

资源简介：

# Kimina-Prover-Promptset Kimina-Prover-Promptset is a curated subset of [NuminaMath-LEAN](https://huggingface.co/datasets/AI-MO/NuminaMath-LEAN), designed for reinforcement learning (RL) training of formal theorem provers in Lean 4. Compared to the full dataset, this subset contains **fewer problems but with higher difficulty**. [NuminaMath-LEAN](https://huggingface.co/datasets/AI-MO/NuminaMath-LEAN) is filtered and preprocessed as follows to create this dataset: - Remove easy problems with a historical win rate above 0.5 to only keeep challenging statements in the dataset. - Generate variants of existing problems to increase diversity using Gemini - Duplicate hard problems to give them more weight during training The resulting dataset contains challenging, high-value problems for improving Lean 4 theorem proving models. This dataset was used to train **[AI-MO/Kimina-Prover-RL-1.7B](https://huggingface.co/AI-MO/Kimina-Prover-RL-1.7B)**, a 1.7B parameter Lean 4 prover reaching **76.63% Pass@32** on the MiniF2F benchmark. ## Dataset Structure | Column | Type | Description | |--------|------|-------------| | `statement_id` | `string` | Unique identifier of the formal statement. | | `natural_language` | `string` | Natural language statement of the problem. | | `formal_statement` | `string` | Lean 4 formal statement corresponding to the problem. | | `source` | `string` | Origin of the statement. `"synthetic"` indicates it was generated by our pipeline. | | `name` | `string` | Name of the problem. | ## Licensing Information The dataset is available under the Apache License, Version 2.0. ## Citation Information ``` @article{kimina_prover_2025, title = {Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning}, author = {Wang, Haiming and Unsal, Mert and Lin, Xiaohan and Baksys, Mantas and Liu, Junqi and Santos, Marco Dos and Sung, Flood and Vinyes, Marina and Ying, Zhenzhe and Zhu, Zekai and Lu, Jianqiao and Saxcé, Hugues de and Bailey, Bolton and Song, Chendong and Xiao, Chenjun and Zhang, Dehao and Zhang, Ebony and Pu, Frederick and Zhu, Han and Liu, Jiawei and Bayer, Jonas and Michel, Julien and Yu, Longhui and Dreyfus-Schmidt, Léo and Tunstall, Lewis and Pagani, Luigi and Machado, Moreira and Bourigault, Pauline and Wang, Ran and Polu, Stanislas and Barroyer, Thibaut and Li, Wen-Ding and Niu, Yazhe and Fleureau, Yann and Hu, Yangyang and Yu, Zhouliang and Wang, Zihan and Yang, Zhilin and Liu, Zhengying and Li, Jia}, year = {2025}, url = {http://arxiv.org/abs/2504.11354}, } ```

# Kimina-Prover-Promptset Kimina-Prover-Promptset 是精选自 [NuminaMath-LEAN](https://huggingface.co/datasets/AI-MO/NuminaMath-LEAN) 的子集，专为 Lean 4 环境下的形式化定理证明器的强化学习（reinforcement learning, RL）训练打造。相较于完整数据集，该子集的问题体量更小，但整体难度更高。 [NuminaMath-LEAN](https://huggingface.co/datasets/AI-MO/NuminaMath-LEAN) 经以下筛选与预处理流程后构建为本数据集： - 移除历史胜率超过0.5的简易问题，仅保留具有挑战性的命题； - 借助 Gemini 生成现有问题的变体，以扩充数据集的多样性； - 对高难度问题进行重复采样，使其在训练阶段获得更高的权重。最终产出的数据集包含兼具挑战性与高研究价值的命题，用于优化 Lean 4 定理证明模型。本数据集曾用于训练 **[AI-MO/Kimina-Prover-RL-1.7B](https://huggingface.co/AI-MO/Kimina-Prover-RL-1.7B)**——一款参数量达17亿的 Lean 4 证明器，在 MiniF2F 基准测试中取得了76.63%的 Pass@32 指标。 ## 数据集结构 | 列名 | 数据类型 | 描述 | |--------|------|-------------| | `statement_id` | `string` | 形式化命题的唯一标识符。 | | `natural_language` | `string` | 问题的自然语言表述。 | | `formal_statement` | `string` | 对应问题的 Lean 4 形式化语句。 | | `source` | `string` | 命题的来源。若字段值为 `"synthetic"`，则表示该命题由我们的流水线生成。 | | `name` | `string` | 问题的名称。 | ## 许可信息本数据集采用 Apache License, Version 2.0 协议发布。 ## 引用信息 @article{kimina_prover_2025, title = {Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning}, author = {Wang, Haiming and Unsal, Mert and Lin, Xiaohan and Baksys, Mantas and Liu, Junqi and Santos, Marco Dos and Sung, Flood and Vinyes, Marina and Ying, Zhenzhe and Zhu, Zekai and Lu, Jianqiao and Saxcé, Hugues de and Bailey, Bolton and Song, Chendong and Xiao, Chenjun and Zhang, Dehao and Zhang, Ebony and Pu, Frederick and Zhu, Han and Liu, Jiawei and Bayer, Jonas and Michel, Julien and Yu, Longhui and Dreyfus-Schmidt, Léo and Tunstall, Lewis and Pagani, Luigi and Machado, Moreira and Bourigault, Pauline and Wang, Ran and Polu, Stanislas and Barroyer, Thibaut and Li, Wen-Ding and Niu, Yazhe and Fleureau, Yann and Hu, Yangyang and Yu, Zhouliang and Wang, Zihan and Yang, Zhilin and Liu, Zhengying and Li, Jia}, year = {2025}, url = {http://arxiv.org/abs/2504.11354}, }

提供机构：

maas

创建时间：

2025-08-15

5,000+

优质数据集

54 个

任务类型

进入经典数据集