salmankhanpm/dolma3_dolmino_mix-100B-1025
收藏Hugging Face2026-01-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/salmankhanpm/dolma3_dolmino_mix-100B-1025
下载链接
链接失效反馈官方服务:
资源简介:
---
license: odc-by
task_categories:
- text-generation
pretty_name: dolma3-dolmino 100B Mix (October 2025)
language:
- en
configs:
- config_name: default
data_files:
- split: train
path: data/**/*
- config_name: stem_heavy_crawl
data_files:
- split: train
path: data/stem_heavy_crawl/**/*
- config_name: stack_edu_fim
data_files:
- split: train
path: data/stack_edu-fim-*/**/*
- config_name: cranecode
data_files:
- split: train
path: data/cranecode/**/*
- config_name: cranemath
data_files:
- split: train
path: data/cranemath/**/*
- config_name: megamatt
data_files:
- split: train
path: data/megamatt/**/*
- config_name: dolmino1_math
data_files:
- split: train
path: data/dolmino1_math/**/*
- config_name: omr_rewrite_fullthoughts
data_files:
- split: train
path: data/omr-rewrite-fullthoughts/**/*
- config_name: tinyMATH_mind
data_files:
- split: train
path: data/tinyMATH-mind/**/*
- config_name: tinyMATH_pot
data_files:
- split: train
path: data/tinyMATH-pot/**/*
- config_name: reddit_high
data_files:
- split: train
path: data/reddit_to_flashcards-high_relevance/**/*
- config_name: reddit_low
data_files:
- split: train
path: data/reddit_to_flashcards-low_relevance/**/*
- config_name: wiki_to_rcqa
data_files:
- split: train
path: data/wiki_to_rcqa/**/*
- config_name: nemotron_synth_qa
data_files:
- split: train
path: data/nemotron-synth_qa/**/*
- config_name: tulu_3_sft
data_files:
- split: train
path: data/tulu_3_sft/**/*
- config_name: dolmino_1_flan
data_files:
- split: train
path: data/dolmino_1-flan/**/*
- config_name: r1_reasoning_traces
data_files:
- split: train
path: data/r1_reasoning_traces/**/*
- config_name: qwq_reasoning_traces
data_files:
- split: train
path: data/qwq_reasoning_traces/**/*
- config_name: gemini_reasoning_traces
data_files:
- split: train
path: data/gemini_reasoning_traces/**/*
- config_name: llamanemotron_reasoning_traces
data_files:
- split: train
path: data/llamanemotron_reasoning_traces/**/*
- config_name: openthoughts2_reasoning_traces
data_files:
- split: train
path: data/openthoughts2_reasoning_traces/**/*
- config_name: verifiable_gpt41
data_files:
- split: train
path: data/verifiable-gpt41/**/*
- config_name: verifiable_o4mini
data_files:
- split: train
path: data/verifiable-o4mini/**/*
- config_name: math_meta_reasoning
data_files:
- split: train
path: data/math_meta_reasoning/**/*
- config_name: code_meta_reasoning
data_files:
- split: train
path: data/code_meta_reasoning/**/*
- config_name: olmocr_science_pdfs
data_files:
- split: train
path: data/olmocr_science_pdfs-high_quality-*/**/*
- config_name: common_crawl_hq
data_files:
- split: train
path: data/common_crawl-high_quality-*/**/*
---
<img alt="Logo for Dolmino Mix" src="dolmino-mix.png" width="289px" style="margin-left:'auto' margin-right:'auto' display:'block'">
# Dolma 3 Dolmino Mix (100B)
The Dolma 3 Dolmino Mix (100B) is the mixture of high-quality data used for the second stage of training for Olmo 3 7B model.
### Dataset Sources
| Source | Category | Tokens | Documents |
|--------|----------|--------|-----------|
| TinyMATH Mind | Math (synth) | 898M (0.9%) | 1.52M |
| TinyMATH PoT | Math (synth) | 241M (0.24%) | 758K |
| CraneMath | Math (synth) | 5.62B (5.63%) | 7.24M |
| MegaMatt | Math (synth) | 1.73B (1.73%) | 3.23M |
| Dolmino Math | Math (synth) | 10.7B (10.7%) | 22.3M |
| StackEdu (FIM) | Code | 10.0B (10.0%) | 16.2M |
| CraneCode | Python (synth) | 10.0B (10.0%) | 11.7M |
| Reddit To Flashcards | QA (synth) | 5.90B (5.9%) | 101M |
| Wiki To RCQA | QA (synth) | 3.0B (3.0%) | 16.3M |
| Nemotron Synth QA | QA (synth) | 5.0B (5.0%) | 10.6M |
| Math Meta-Reasoning | Thinking (synth) | 381M (0.38%) | 401K |
| Code Meta-Reasoning | Thinking (synth) | 459M (0.46%) | 398K |
| Program-Verifiable | Thinking (synth) | 159M (0.16%) | 158K |
| OMR Rewrite FullThoughts | Thinking (synth) | 850M (0.85%) | 394K |
| QWQ Reasoning Traces | Thinking (synth) | 1.87B (1.87%) | 401K |
| General Reasoning Mix | Thinking (synth) | 1.87B (1.87%) | 732K |
| Gemini Reasoning Traces | Thinking (synth) | 246M (0.25%) | 85.1K |
| Llama Nemotron Reasoning Traces | Thinking (synth) | 1.25B (1.25%) | 368K |
| OpenThoughts2 Reasoning Traces | Thinking (synth) | 1.25B (1.25%) | 402K |
| Tulu 3 SFT | Instruction (synth) | 1.1B (1.1%) | 1.45M |
| Dolmino 1 Flan | Instruction (synth) | 5.0B (5.0%) | 14.8M |
| OLMOCR Science PDFs (High Q.) | PDFs | 4.99B (5.0%) | 1.20M |
| STEM-Heavy Crawl | Web pages | 4.99B (5.0%) | 5.53M |
| Common Crawl (High Q.) | Web pages | 22.4B (22.5%) | 18.3M |
| **Total** | | **99.95B (100%)** | **236M** |
---
#### Mix Compositions
| Source | 10B | | 100B | |
|--------|-----|-----|------|-----|
| | Source % | Mix % | Source % | Mix % |
| STEM-Heavy Crawl | - | - | 5.0% | 5.0% |
| StackEdu (FIM) | - | - | 10.0% | 10.0% |
| CraneCode | - | - | 10.0% | 10.0% |
| CraneMath | - | - | 5.63% | 5.63% |
| MegaMatt | - | - | 1.73% | 1.73% |
| Dolmino Math | - | - | 10.7% | 10.7% |
| OMR Rewrite FullThoughts | - | - | 0.85% | 0.85% |
| TinyMATH Mind | - | - | 0.9% | 0.9% |
| TinyMATH PoT | - | - | 0.24% | 0.24% |
| Reddit To Flashcards | - | - | 5.9% | 5.9% |
| Wiki To RCQA | - | - | 3.0% | 3.0% |
| Nemotron Synth QA | - | - | 5.0% | 5.0% |
| Tulu 3 SFT | - | - | 1.1% | 1.1% |
| Dolmino 1 Flan | - | - | 5.0% | 5.0% |
| QWQ Reasoning Traces | - | - | 1.87% | 1.87% |
| Gemini Reasoning Traces | - | - | 0.25% | 0.25% |
| Llama Nemotron Reasoning Traces | - | - | 1.25% | 1.25% |
| OpenThoughts2 Reasoning Traces | - | - | 1.25% | 1.25% |
| Program-Verifiable | - | - | 0.16% | 0.16% |
| Math Meta-Reasoning | - | - | 0.38% | 0.38% |
| Code Meta-Reasoning | - | - | 0.46% | 0.46% |
| General Reasoning Mix | - | - | 1.87% | 1.87% |
| OLMOCR Science PDFs (High Q.) | - | - | 5.0% | 5.0% |
| Common Crawl (High Q.) | - | - | 22.5% | 22.5% |
## Licensing Information
Dolma 3 Dolmino is licensed under the Open Data Commons Attribution License v1.0 (ODC-By). It is intended for research and educational use. For more information, please see our [Responsible Use Guidelines](https://allenai.org/responsible-use).
## Citation
```
@misc{olmo2025olmo3,
title={Olmo 3},
author={Team Olmo and Allyson Ettinger and Amanda Bertsch and Bailey Kuehl and David Graham and David Heineman and Dirk Groeneveld and Faeze Brahman and Finbarr Timbers and Hamish Ivison and Jacob Morrison and Jake Poznanski and Kyle Lo and Luca Soldaini and Matt Jordan and Mayee Chen and Michael Noukhovitch and Nathan Lambert and Pete Walsh and Pradeep Dasigi and Robert Berry and Saumya Malik and Saurabh Shah and Scott Geng and Shane Arora and Shashank Gupta and Taira Anderson and Teng Xiao and Tyler Murray and Tyler Romero and Victoria Graf and Akari Asai and Akshita Bhagia and Alexander Wettig and Alisa Liu and Aman Rangapur and Chloe Anastasiades and Costa Huang and Dustin Schwenk and Harsh Trivedi and Ian Magnusson and Jaron Lochner and Jiacheng Liu and Lester James V. Miranda and Maarten Sap and Malia Morgan and Michael Schmitz and Michal Guerquin and Michael Wilson and Regan Huff and Ronan Le Bras and Rui Xin and Rulin Shao and Sam Skjonsberg and Shannon Zejiang Shen and Shuyue Stella Li and Tucker Wilde and Valentina Pyatkin and Will Merrill and Yapei Chang and Yuling Gu and Zhiyuan Zeng and Ashish Sabharwal and Luke Zettlemoyer and Pang Wei Koh and Ali Farhadi and Noah A. Smith and Hannaneh Hajishirzi},
year={2025},
eprint={2512.13961},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.13961},
}
```
许可协议:odc-by
任务类别:
- 文本生成
美观名称:dolma3-dolmino 100B混合数据集(2025年10月)
语言:
- 英语
配置项:
- 配置名称:default
数据文件:
- 拆分方式:训练集
路径:data/**/*
- 配置名称:stem_heavy_crawl
数据文件:
- 拆分方式:训练集
路径:data/stem_heavy_crawl/**/*
- 配置名称:stack_edu_fim
数据文件:
- 拆分方式:训练集
路径:data/stack_edu-fim-*/**/*
- 配置名称:cranecode
数据文件:
- 拆分方式:训练集
路径:data/cranecode/**/*
- 配置名称:cranemath
数据文件:
- 拆分方式:训练集
路径:data/cranemath/**/*
- 配置名称:megamatt
数据文件:
- 拆分方式:训练集
路径:data/megamatt/**/*
- 配置名称:dolmino1_math
数据文件:
- 拆分方式:训练集
路径:data/dolmino1_math/**/*
- 配置名称:omr_rewrite_fullthoughts
数据文件:
- 拆分方式:训练集
路径:data/omr-rewrite-fullthoughts/**/*
- 配置名称:tinyMATH_mind
数据文件:
- 拆分方式:训练集
路径:data/tinyMATH-mind/**/*
- 配置名称:tinyMATH_pot
数据文件:
- 拆分方式:训练集
路径:data/tinyMATH-pot/**/*
- 配置名称:reddit_high
数据文件:
- 拆分方式:训练集
路径:data/reddit_to_flashcards-high_relevance/**/*
- 配置名称:reddit_low
数据文件:
- 拆分方式:训练集
路径:data/reddit_to_flashcards-low_relevance/**/*
- 配置名称:wiki_to_rcqa
数据文件:
- 拆分方式:训练集
路径:data/wiki_to_rcqa/**/*
- 配置名称:nemotron_synth_qa
数据文件:
- 拆分方式:训练集
路径:data/nemotron-synth_qa/**/*
- 配置名称:tulu_3_sft
数据文件:
- 拆分方式:训练集
路径:data/tulu_3_sft/**/*
- 配置名称:dolmino_1_flan
数据文件:
- 拆分方式:训练集
路径:data/dolmino_1-flan/**/*
- 配置名称:r1_reasoning_traces
数据文件:
- 拆分方式:训练集
路径:data/r1_reasoning_traces/**/*
- 配置名称:qwq_reasoning_traces
数据文件:
- 拆分方式:训练集
路径:data/qwq_reasoning_traces/**/*
- 配置名称:gemini_reasoning_traces
数据文件:
- 拆分方式:训练集
路径:data/gemini_reasoning_traces/**/*
- 配置名称:llamanemotron_reasoning_traces
数据文件:
- 拆分方式:训练集
路径:data/llamanemotron_reasoning_traces/**/*
- 配置名称:openthoughts2_reasoning_traces
数据文件:
- 拆分方式:训练集
路径:data/openthoughts2_reasoning_traces/**/*
- 配置名称:verifiable_gpt41
数据文件:
- 拆分方式:训练集
路径:data/verifiable-gpt41/**/*
- 配置名称:verifiable_o4mini
数据文件:
- 拆分方式:训练集
路径:data/verifiable-o4mini/**/*
- 配置名称:math_meta_reasoning
数据文件:
- 拆分方式:训练集
路径:data/math_meta_reasoning/**/*
- 配置名称:code_meta_reasoning
数据文件:
- 拆分方式:训练集
路径:data/code_meta_reasoning/**/*
- 配置名称:olmocr_science_pdfs
数据文件:
- 拆分方式:训练集
路径:data/olmocr_science_pdfs-high_quality-*/**/*
- 配置名称:common_crawl_hq
数据文件:
- 拆分方式:训练集
路径:data/common_crawl-high_quality-*/**/*
<img alt="Dolmino Mix 标识" src="dolmino-mix.png" width="289px" style="margin-left:auto; margin-right:auto; display:block">
# Dolma 3 Dolmino 混合数据集(100B)
本数据集是用于Olmo 3 7B模型第二阶段训练的高质量多源数据混合集合。
### 数据集来源
| 数据源 | 类别 | Token数 | 文档数 |
|--------|------|--------|-----------|
| TinyMATH Mind | 数学(合成数据) | 898M (0.9%) | 1.52M |
| TinyMATH PoT | 数学(合成数据) | 241M (0.24%) | 758K |
| CraneMath | 数学(合成数据) | 5.62B (5.63%) | 7.24M |
| MegaMatt | 数学(合成数据) | 1.73B (1.73%) | 3.23M |
| Dolmino Math | 数学(合成数据) | 10.7B (10.7%) | 22.3M |
| StackEdu (FIM) | 代码 | 10.0B (10.0%) | 16.2M |
| CraneCode | Python(合成数据) | 10.0B (10.0%) | 11.7M |
| Reddit To Flashcards | 问答(合成数据) | 5.90B (5.9%) | 101M |
| Wiki To RCQA | 问答(合成数据) | 3.0B (3.0%) | 16.3M |
| Nemotron Synth QA | 问答(合成数据) | 5.0B (5.0%) | 10.6M |
| Math Meta-Reasoning | 思维推理(合成数据) | 381M (0.38%) | 401K |
| Code Meta-Reasoning | 思维推理(合成数据) | 459M (0.46%) | 398K |
| Program-Verifiable | 思维推理(合成数据) | 159M (0.16%) | 158K |
| OMR Rewrite FullThoughts | 思维推理(合成数据) | 850M (0.85%) | 394K |
| QWQ Reasoning Traces | 思维推理(合成数据) | 1.87B (1.87%) | 401K |
| General Reasoning Mix | 思维推理(合成数据) | 1.87B (1.87%) | 732K |
| Gemini Reasoning Traces | 思维推理(合成数据) | 246M (0.25%) | 85.1K |
| Llama Nemotron Reasoning Traces | 思维推理(合成数据) | 1.25B (1.25%) | 368K |
| OpenThoughts2 Reasoning Traces | 思维推理(合成数据) | 1.25B (1.25%) | 402K |
| Tulu 3 SFT | 指令微调(合成数据) | 1.1B (1.1%) | 1.45M |
| Dolmino 1 Flan | 指令微调(合成数据) | 5.0B (5.0%) | 14.8M |
| OLMOCR Science PDFs (High Q.) | 学术PDF | 4.99B (5.0%) | 1.20M |
| STEM-Heavy Crawl | 网页文本 | 4.99B (5.0%) | 5.53M |
| Common Crawl (High Q.) | 网页文本 | 22.4B (22.5%) | 18.3M |
| **总计** | | **99.95B (100%)** | **236M** |
---
#### 混合构成
| 数据源 | 10B | | 100B | |
|--------|-----|-----|------|-----|
| | 数据源占比 | 数据集混合占比 | 数据源占比 | 数据集混合占比 |
| STEM-Heavy Crawl | - | - | 5.0% | 5.0% |
| StackEdu (FIM) | - | - | 10.0% | 10.0% |
| CraneCode | - | - | 10.0% | 10.0% |
| CraneMath | - | - | 5.63% | 5.63% |
| MegaMatt | - | - | 1.73% | 1.73% |
| Dolmino Math | - | - | 10.7% | 10.7% |
| OMR Rewrite FullThoughts | - | - | 0.85% | 0.85% |
| TinyMATH Mind | - | - | 0.9% | 0.9% |
| TinyMATH PoT | - | - | 0.24% | 0.24% |
| Reddit To Flashcards | - | - | 5.9% | 5.9% |
| Wiki To RCQA | - | - | 3.0% | 3.0% |
| Nemotron Synth QA | - | - | 5.0% | 5.0% |
| Tulu 3 SFT | - | - | 1.1% | 1.1% |
| Dolmino 1 Flan | - | - | 5.0% | 5.0% |
| QWQ Reasoning Traces | - | - | 1.87% | 1.87% |
| Gemini Reasoning Traces | - | - | 0.25% | 0.25% |
| Llama Nemotron Reasoning Traces | - | - | 1.25% | 1.25% |
| OpenThoughts2 Reasoning Traces | - | - | 1.25% | 1.25% |
| Program-Verifiable | - | - | 0.16% | 0.16% |
| Math Meta-Reasoning | - | - | 0.38% | 0.38% |
| Code Meta-Reasoning | - | - | 0.46% | 0.46% |
| General Reasoning Mix | - | - | 1.87% | 1.87% |
| OLMOCR Science PDFs (High Q.) | - | - | 5.0% | 5.0% |
| Common Crawl (High Q.) | - | - | 22.5% | 22.5% |
## 许可信息
Dolma 3 Dolmino 采用**开放数据 Commons 署名许可 v1.0(Open Data Commons Attribution License v1.0,简称ODC-By)**进行授权。本数据集仅用于研究与教育用途。如需了解更多信息,请参阅我们的[负责任使用指南](https://allenai.org/responsible-use)。
## 引用
@misc{olmo2025olmo3,
title={Olmo 3},
author={Team Olmo and Allyson Ettinger and Amanda Bertsch and Bailey Kuehl and David Graham and David Heineman and Dirk Groeneveld and Faeze Brahman and Finbarr Timbers and Hamish Ivison and Jacob Morrison and Jake Poznanski and Kyle Lo and Luca Soldaini and Matt Jordan and Mayee Chen and Michael Noukhovitch and Nathan Lambert and Pete Walsh and Pradeep Dasigi and Robert Berry and Saumya Malik and Saurabh Shah and Scott Geng and Shane Arora and Shashank Gupta and Taira Anderson and Teng Xiao and Tyler Murray and Tyler Romero and Victoria Graf and Akari Asai and Akshita Bhagia and Alexander Wettig and Alisa Liu and Aman Rangapur and Chloe Anastasiades and Costa Huang and Dustin Schwenk and Harsh Trivedi and Ian Magnusson and Jaron Lochner and Jiacheng Liu and Lester James V. Miranda and Maarten Sap and Malia Morgan and Michael Schmitz and Michal Guerquin and Michael Wilson and Regan Huff and Ronan Le Bras and Rui Xin and Rulin Shao and Sam Skjonsberg and Shannon Zejiang Shen and Shuyue Stella Li and Tucker Wilde and Valentina Pyatkin and Will Merrill and Yapei Chang and Yuling Gu and Zhiyuan Zeng and Ashish Sabharwal and Luke Zettlemoyer and Pang Wei Koh and Ali Farhadi and Noah A. Smith and Hannaneh Hajishirzi},
year={2025},
eprint={2512.13961},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.13961},
}
提供机构:
salmankhanpm



