II-Thought-RL-v0
收藏魔搭社区2025-12-04 更新2025-04-05 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/II-Thought-RL-v0
下载链接
链接失效反馈官方服务:
资源简介:
## II-Thought RL v0: A Large-Scale Curated Dataset for Reinforcement Learning

*See our blog [**here**](https://ii.inc/web/blog/post/ii-thought) for additional details.*
We introduce II-Thought RL v0, the first large-scale, multi-task dataset designed for Reinforcement Learning. This dataset consists of high-quality question-answer pairs that have undergone a rigorous multi-step filtering process, leveraging Gemini 2.0 Flash and Qwen 32B as quality evaluators.
In this initial release, we have curated and refined publicly available datasets while also introducing our own high-quality question pairs. Looking ahead, future iterations will focus on less accessible but verifiable domains, such as science, engineering, medicine, and finance. Additionally, we aim to incorporate reasoning traces using R1 to support reasoning distillation for smaller models.
<img src="graph.png" width="700">**Graph:** Data Curation Process
### **Mathematics**
Our mathematics dataset is a deduplicated and curated aggregation of [HARP](https://arxiv.org/abs/2412.08819),[OMNI-Math](https://huggingface.co/datasets/KbsdJames/Omni-MATH), [Numina-Math-CoT](https://huggingface.co/datasets/ai-mo/numinamath-cot), [Numina-Math-1.5](https://huggingface.co/datasets/ai-mo/numinamath-1.5), [DeepScaler](https://huggingface.co/datasets/agentica-org/deepscaler-preview-dataset), and our own set of verifiable IMO Shortlist problems.
- To introduce our new colletion, we collected IMO and IMO-Shortlist pdfs and then ultilized [MinerU](github.com/opendatalab/mineru?tab=readme-ov-file#mineru) to extract out high quality math expression.
- The Markdown is then fed to Gemini-2.0-Flash in a sliding window fashion to extract high-quality problem/solution pairs, this ensure that we can extract problems from long pdf files.
To construct the final subset:
- First, we use regex to do a preminary filtering for verifiable subset (removing proof, multiple choice, multiple parts pattern that can be easily filtered).
- We then evaluate the quality of the problems using Gemini 2.0 Flash, keeping only good and excellent problems.
- Finally, following [Big-Math](https://huggingface.co/datasets/SynthLabsAI/Big-Math-RL-Verified) we use Qwen 32B to filter out questions unsuitable for RL training, such as proofs, yes/no answers, multiple-choice and multi-part questions (see our technical report for details).
### **Code**
The coding dataset is a deduplicated and curated aggregation of [Apps](https://huggingface.co/datasets/codeparrot/apps), [Taco](https://huggingface.co/datasets/baai/taco) (from [PrimeIntellect/Synthetic1](https://huggingface.co/datasets/primeintellect/synthetic-1)), [Code Contest](https://huggingface.co/datasets/deepmind/code_contests), [Codeforces](https://huggingface.co/datasets/matrixstudio/codeforces-python-submissions), and our own [collection](https://huggingface.co/datasets/intelligent-internet/acm-icpc-rl-v0) of 20 years of ICPC and regional coding contest problems.
- The ICPC problems were extracted from ICPC exams pdf using Gemini-2.0-Flash in a sliding window fashion, seperating high quality problems, solutions and test cases.
- First removed all problesm with no test cases, and then evaluate the quality of the problems using Gemini 2.0 Flash, keeping only good and excellent problems.
- We then use Qwen 32B as a final quality check, removing all problems that have bad formatting, contain figures that are essential for the solution.
### **Science**
Our science dataset includes a verifiable subset of Camel [Physics](https://huggingface.co/datasets/camel-ai/physics), [Chemistry](https://huggingface.co/datasets/camel-ai/chemistry) and [Biology](https://huggingface.co/datasets/camel-ai/biology), primarily consisting of problems with numerical answers.
Additionally, we introduce 13,000 curated question-answer pairs sourced from publicly available and verifiable scientific content.
### **Other**
Additionally, to include more domains in our collections, other sources in our dataset include:
- [FreedomIntelligence/medical-o1-verifiable-problem](https://huggingface.co/datasets/freedomintelligence/medical-o1-reasoning-sft)
- [INK-USC/riddle_sense](https://huggingface.co/datasets/INK-USC/riddle_sense)
- A small subset of [GeneralReasoning/GeneralThought-Feb25](https://huggingface.co/datasets/GeneralReasoning/GeneralThought-Feb25)
Each subset follows our multi-step filtering approach to maintain high quality and RL suitability. We are working on adding more domain in the next iteration.
Finally, the final dataset go through a near-match deduplication process, before going through our strict de-contamination pipeline, ensuring data integrity in training. See the table below for the statistics of problems that are contaminated.
| Dataset | MATH500 | AIME2024 | AIME2025 | LiveCodeBench | Gakao-En | Olympiad Bench | AMC |
| --- | --- | --- | --- | --- | --- | --- | --- |
| AI-MO/NuminaMath-CoT | 8104/1 | 0 | 5 | 0 | 792/1 | 491/2 | 47 |
| AI-MO/NuminaMath-1.5 | 6154/3 | 48/15 | 10/0 | 0 | 601/0 | 854/7 | 68 |
| agentica-org/DeepScaleR-Preview-Dataset | 627/1 | 0 | 2 | 0 | 75/1 | 77 | 4 |
| Intelligent-Internet/ICPC-RL-v2-formatted | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| PrimeIntellect/SYNTHETIC-1 | 69 | 0 | 0 | 0 | 4 | 119 | 0 |
**Table 1** Problems removed as the result of data-contamination.
Finally, we obtain our first iteration of II-Thought:
| Dataset | Domain | Source | Samples |
|-----------------------------------|---------|---------------------------------------------------------------|---------:|
| NuminaMath-1.5 | Math | AI-MO/NuminaMath-1.5 | 123442 |
| Real World SWE | Code | primeintellect/real-world-swe-problems | 69176 |
| Mix-Math | Math | AI-MO/NuminaMath-CoT, OmniMath, HARP, IMO-ShortList | 53532 |
| medical-o1-verifiable-problem | Medical | FreedomIntelligence/medical-o1-verifiable-problem | 38986 |
| DeepScaler | Math | agentica-org/DeepScaleR-Preview-Dataset | 12573 |
| OpenTextBook | Science | crawl/text_book | 10593 |
| GeneralThought-Feb25 | Reasoning | GeneralReasoning/GeneralThought-Feb25 | 9075 |
| Code Contest | Code | deepmind/code_contests | 8937 |
| Apps & Taco | Code | PrimeIntellect/SYNTHETIC-1 | 7450 |
| riddle_sense | Riddle | ink-usc/riddle_sense | 3454 |
| Python Codeforces | Code | matrixstudio/codeforces-python-submissions | 2143 |
| Open-ICPC | Code | crawl/icpc | 1990 |
| CAMEL Physics | Science | camel-ai/physics | 271 |
| CAMEL Chemistry | Science | camel-ai/chemistry | 168 |
| CAMEL Biology | Science | camel-ai/biology | 5 |
| Total | | | 341795 |
**Table 2:** Summary of final datasets after refinement in *II-Thought*.
<img src="curated_plot.png" width="700">
## T-SNE Statistics
|  |  |
|------------------------|------------------------|
## Citation
```bib
@misc{2025iithought,
title={II-Thought : A Large-Scale, High-Quality Reasoning Dataset},
author={Intelligent Internet}
year={2025},
}
```
## II-Thought RL v0:面向强化学习(Reinforcement Learning)的大规模精选数据集

更多详情请参阅我们的官方博客[**此处**](https://ii.inc/web/blog/post/ii-thought)。
我们推出II-Thought RL v0,这是首个面向强化学习的大规模多任务数据集。该数据集包含经过严格多阶段筛选流程的高质量问答对,筛选过程中以Gemini 2.0 Flash与Qwen 32B作为质量评估工具。
在本次首次发布中,我们对公开可用的数据集进行了精选与优化,同时推出了自研的高质量问答对。未来迭代版本将聚焦于科学、工程、医学与金融等难以公开获取但可验证的领域。此外,我们计划引入基于R1的推理轨迹,以支持小型模型的推理蒸馏任务。
**图1:** 数据精选流程
### 数学
本数学数据集是对[HARP](https://arxiv.org/abs/2412.08819)、[OMNI-Math](https://huggingface.co/datasets/KbsdJames/Omni-Math)、[Numina-Math-CoT](https://huggingface.co/datasets/ai-mo/numinamath-cot)、[Numina-Math-1.5](https://huggingface.co/datasets/ai-mo/numinamath-1.5)、[DeepScaler](https://huggingface.co/datasets/agentica-org/deepscaler-preview-dataset)以及我们自研的可验证国际数学奥林匹克(International Mathematical Olympiad, IMO)备选题集进行去重与精选后的聚合结果。
- 为推出本次新增数据集,我们收集了IMO与IMO备选题的PDF文件,随后使用[MinerU](github.com/opendatalab/mineru?tab=readme-ov-file#mineru)提取高质量数学表达式。
- 随后将Markdown文件以滑动窗口的方式输入Gemini-2.0-Flash,以提取高质量的题目与解答对,这一方案可支持从长PDF文件中提取题目内容。
为构建最终子集,我们执行以下步骤:
- 首先,我们使用正则表达式对可验证子集进行初步筛选,移除可被轻易过滤的证明题、选择题与多小问题型。
- 随后使用Gemini 2.0 Flash对题目质量进行评估,仅保留优质与极佳的题目。
- 最后,参考[Big-Math](https://huggingface.co/datasets/SynthLabsAI/Big-Math-RL-Verified)的方案,我们使用Qwen 32B过滤掉不适合强化学习训练的题目,例如证明题、是非题、选择题与多小问题型(详细说明请参阅我们的技术报告)。
### 代码
本代码数据集是对[Apps](https://huggingface.co/datasets/codeparrot/apps)、[Taco](https://huggingface.co/datasets/baai/taco)(源自[PrimeIntellect/Synthetic1](https://huggingface.co/datasets/primeintellect/synthetic-1))、[Code Contest](https://huggingface.co/datasets/deepmind/code_contests)、[Codeforces](https://huggingface.co/datasets/matrixstudio/codeforces-python-submissions)以及我们自研的20年国际大学生程序设计竞赛(International Collegiate Programming Contest, ICPC)与区域赛题目集[collection](https://huggingface.co/datasets/intelligent-internet/acm-icpc-rl-v0)进行去重与精选后的聚合结果。
- 我们使用Gemini-2.0-Flash以滑动窗口的方式从ICPC考试PDF中提取题目,分离出高质量的题目、解答与测试用例。
- 首先移除所有无测试用例的题目,随后使用Gemini 2.0 Flash对题目质量进行评估,仅保留优质与极佳的题目。
- 随后使用Qwen 32B进行最终质量校验,移除所有格式错误、依赖解题所需图表的题目。
### 科学
本科学数据集包含CAMEL团队开源的[物理](https://huggingface.co/datasets/camel-ai/physics)、[化学](https://huggingface.co/datasets/camel-ai/chemistry)与[生物](https://huggingface.co/datasets/camel-ai/biology)数据集的可验证子集,其中绝大多数为带有数值答案的题目。
此外,我们还推出了13000条从公开可验证的科学内容中精选的问答对。
### 其他
此外,为覆盖更多领域,本数据集还包含以下来源的数据:
- [FreedomIntelligence/medical-o1-verifiable-problem](https://huggingface.co/datasets/freedomintelligence/medical-o1-reasoning-sft)
- [INK-USC/riddle_sense](https://huggingface.co/datasets/INK-USC/riddle_sense)
- [GeneralReasoning/GeneralThought-Feb25](https://huggingface.co/datasets/GeneralReasoning/GeneralThought-Feb25) 的小部分子集
每个子集均遵循我们的多阶段筛选流程,以确保高质量与适配强化学习训练的要求。我们计划在下一迭代版本中新增更多领域的数据。
最终,本数据集在进入严格的去污染流程前,会先经过近重复去重步骤,以确保训练数据的完整性。有关污染问题的统计数据请参阅下表。
**表1:** 因数据污染被移除的题目统计
| 数据集 | MATH500 | AIME2024 | AIME2025 | LiveCodeBench | Gakao-En | Olympiad Bench | AMC |
| --- | --- | --- | --- | --- | --- | --- | --- |
| AI-MO/NuminaMath-CoT | 8104/1 | 0 | 5 | 0 | 792/1 | 491/2 | 47 |
| AI-MO/NuminaMath-1.5 | 6154/3 | 48/15 | 10/0 | 0 | 601/0 | 854/7 | 68 |
| agentica-org/DeepScaleR-Preview-Dataset | 627/1 | 0 | 2 | 0 | 75/1 | 77 | 4 |
| Intelligent-Internet/ICPC-RL-v2-formatted | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| PrimeIntellect/SYNTHETIC-1 | 69 | 0 | 0 | 0 | 4 | 119 | 0 |
最终,我们得到了II-Thought的首个迭代版本:
**表2:** *II-Thought* 精炼后的最终数据集汇总
| 数据集 | 领域 | 来源 | 样本量 |
|-----------------------------------|---------|---------------------------------------------------------------|---------:|
| NuminaMath-1.5 | 数学 | AI-MO/NuminaMath-1.5 | 123442 |
| Real World SWE | 代码 | primeintellect/real-world-swe-problems | 69176 |
| Mix-Math | 数学 | AI-MO/NuminaMath-CoT、OmniMath、HARP、IMO备选题集 | 53532 |
| medical-o1-verifiable-problem | 医学 | FreedomIntelligence/medical-o1-verifiable-problem | 38986 |
| DeepScaler | 数学 | agentica-org/DeepScaleR-Preview-Dataset | 12573 |
| OpenTextBook | 科学 | crawl/text_book | 10593 |
| GeneralThought-Feb25 | 推理 | GeneralReasoning/GeneralThought-Feb25 | 9075 |
| Code Contest | 代码 | deepmind/code_contests | 8937 |
| Apps & Taco | 代码 | PrimeIntellect/SYNTHETIC-1 | 7450 |
| riddle_sense | 谜语 | ink-usc/riddle_sense | 3454 |
| Python Codeforces | 代码 | matrixstudio/codeforces-python-submissions | 2143 |
| Open-ICPC | 代码 | crawl/icpc | 1990 |
| CAMEL Physics | 科学 | camel-ai/physics | 271 |
| CAMEL Chemistry | 科学 | camel-ai/chemistry | 168 |
| CAMEL Biology | 科学 | camel-ai/biology | 5 |
| 总计 | | | 341795 |

## T-SNE 统计分析
|  |  |
|------------------------|------------------------|
## 引用
bib
@misc{2025iithought,
title={II-Thought : A Large-Scale, High-Quality Reasoning Dataset},
author={Intelligent Internet}
year={2025},
}
提供机构:
maas
创建时间:
2025-03-30



