five

allenai/dolma3_dolmino_mix-100B-1025

收藏
Hugging Face2026-01-05 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/allenai/dolma3_dolmino_mix-100B-1025
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: odc-by task_categories: - text-generation pretty_name: dolma3-dolmino 100B Mix (October 2025) language: - en configs: - config_name: default data_files: - split: train path: data/**/* - config_name: stem_heavy_crawl data_files: - split: train path: data/stem_heavy_crawl/**/* - config_name: stack_edu_fim data_files: - split: train path: data/stack_edu-fim-*/**/* - config_name: cranecode data_files: - split: train path: data/cranecode/**/* - config_name: cranemath data_files: - split: train path: data/cranemath/**/* - config_name: megamatt data_files: - split: train path: data/megamatt/**/* - config_name: dolmino1_math data_files: - split: train path: data/dolmino1_math/**/* - config_name: omr_rewrite_fullthoughts data_files: - split: train path: data/omr-rewrite-fullthoughts/**/* - config_name: tinyMATH_mind data_files: - split: train path: data/tinyMATH-mind/**/* - config_name: tinyMATH_pot data_files: - split: train path: data/tinyMATH-pot/**/* - config_name: reddit_high data_files: - split: train path: data/reddit_to_flashcards-high_relevance/**/* - config_name: reddit_low data_files: - split: train path: data/reddit_to_flashcards-low_relevance/**/* - config_name: wiki_to_rcqa data_files: - split: train path: data/wiki_to_rcqa/**/* - config_name: nemotron_synth_qa data_files: - split: train path: data/nemotron-synth_qa/**/* - config_name: tulu_3_sft data_files: - split: train path: data/tulu_3_sft/**/* - config_name: dolmino_1_flan data_files: - split: train path: data/dolmino_1-flan/**/* - config_name: r1_reasoning_traces data_files: - split: train path: data/r1_reasoning_traces/**/* - config_name: qwq_reasoning_traces data_files: - split: train path: data/qwq_reasoning_traces/**/* - config_name: gemini_reasoning_traces data_files: - split: train path: data/gemini_reasoning_traces/**/* - config_name: llamanemotron_reasoning_traces data_files: - split: train path: data/llamanemotron_reasoning_traces/**/* - config_name: openthoughts2_reasoning_traces data_files: - split: train path: data/openthoughts2_reasoning_traces/**/* - config_name: verifiable_gpt41 data_files: - split: train path: data/verifiable-gpt41/**/* - config_name: verifiable_o4mini data_files: - split: train path: data/verifiable-o4mini/**/* - config_name: math_meta_reasoning data_files: - split: train path: data/math_meta_reasoning/**/* - config_name: code_meta_reasoning data_files: - split: train path: data/code_meta_reasoning/**/* - config_name: olmocr_science_pdfs data_files: - split: train path: data/olmocr_science_pdfs-high_quality-*/**/* - config_name: common_crawl_hq data_files: - split: train path: data/common_crawl-high_quality-*/**/* --- <img alt="Logo for Dolmino Mix" src="dolmino-mix.png" width="289px" style="margin-left:'auto' margin-right:'auto' display:'block'"> # Dolma 3 Dolmino Mix (100B) The Dolma 3 Dolmino Mix (100B) is the mixture of high-quality data used for the second stage of training for Olmo 3 7B model. ### Dataset Sources | Source | Category | Tokens | Documents | |--------|----------|--------|-----------| | TinyMATH Mind | Math (synth) | 898M (0.9%) | 1.52M | | TinyMATH PoT | Math (synth) | 241M (0.24%) | 758K | | CraneMath | Math (synth) | 5.62B (5.63%) | 7.24M | | MegaMatt | Math (synth) | 1.73B (1.73%) | 3.23M | | Dolmino Math | Math (synth) | 10.7B (10.7%) | 22.3M | | StackEdu (FIM) | Code | 10.0B (10.0%) | 16.2M | | CraneCode | Python (synth) | 10.0B (10.0%) | 11.7M | | Reddit To Flashcards | QA (synth) | 5.90B (5.9%) | 101M | | Wiki To RCQA | QA (synth) | 3.0B (3.0%) | 16.3M | | Nemotron Synth QA | QA (synth) | 5.0B (5.0%) | 10.6M | | Math Meta-Reasoning | Thinking (synth) | 381M (0.38%) | 401K | | Code Meta-Reasoning | Thinking (synth) | 459M (0.46%) | 398K | | Program-Verifiable | Thinking (synth) | 159M (0.16%) | 158K | | OMR Rewrite FullThoughts | Thinking (synth) | 850M (0.85%) | 394K | | QWQ Reasoning Traces | Thinking (synth) | 1.87B (1.87%) | 401K | | General Reasoning Mix | Thinking (synth) | 1.87B (1.87%) | 732K | | Gemini Reasoning Traces | Thinking (synth) | 246M (0.25%) | 85.1K | | Llama Nemotron Reasoning Traces | Thinking (synth) | 1.25B (1.25%) | 368K | | OpenThoughts2 Reasoning Traces | Thinking (synth) | 1.25B (1.25%) | 402K | | Tulu 3 SFT | Instruction (synth) | 1.1B (1.1%) | 1.45M | | Dolmino 1 Flan | Instruction (synth) | 5.0B (5.0%) | 14.8M | | OLMOCR Science PDFs (High Q.) | PDFs | 4.99B (5.0%) | 1.20M | | STEM-Heavy Crawl | Web pages | 4.99B (5.0%) | 5.53M | | Common Crawl (High Q.) | Web pages | 22.4B (22.5%) | 18.3M | | **Total** | | **99.95B (100%)** | **236M** | --- #### Mix Compositions | Source | 10B | | 100B | | |--------|-----|-----|------|-----| | | Source % | Mix % | Source % | Mix % | | STEM-Heavy Crawl | - | - | 5.0% | 5.0% | | StackEdu (FIM) | - | - | 10.0% | 10.0% | | CraneCode | - | - | 10.0% | 10.0% | | CraneMath | - | - | 5.63% | 5.63% | | MegaMatt | - | - | 1.73% | 1.73% | | Dolmino Math | - | - | 10.7% | 10.7% | | OMR Rewrite FullThoughts | - | - | 0.85% | 0.85% | | TinyMATH Mind | - | - | 0.9% | 0.9% | | TinyMATH PoT | - | - | 0.24% | 0.24% | | Reddit To Flashcards | - | - | 5.9% | 5.9% | | Wiki To RCQA | - | - | 3.0% | 3.0% | | Nemotron Synth QA | - | - | 5.0% | 5.0% | | Tulu 3 SFT | - | - | 1.1% | 1.1% | | Dolmino 1 Flan | - | - | 5.0% | 5.0% | | QWQ Reasoning Traces | - | - | 1.87% | 1.87% | | Gemini Reasoning Traces | - | - | 0.25% | 0.25% | | Llama Nemotron Reasoning Traces | - | - | 1.25% | 1.25% | | OpenThoughts2 Reasoning Traces | - | - | 1.25% | 1.25% | | Program-Verifiable | - | - | 0.16% | 0.16% | | Math Meta-Reasoning | - | - | 0.38% | 0.38% | | Code Meta-Reasoning | - | - | 0.46% | 0.46% | | General Reasoning Mix | - | - | 1.87% | 1.87% | | OLMOCR Science PDFs (High Q.) | - | - | 5.0% | 5.0% | | Common Crawl (High Q.) | - | - | 22.5% | 22.5% | ## Licensing Information Dolma 3 Dolmino is licensed under the Open Data Commons Attribution License v1.0 (ODC-By). It is intended for research and educational use. For more information, please see our [Responsible Use Guidelines](https://allenai.org/responsible-use). ## Citation ``` @misc{olmo2025olmo3, title={Olmo 3}, author={Team Olmo and Allyson Ettinger and Amanda Bertsch and Bailey Kuehl and David Graham and David Heineman and Dirk Groeneveld and Faeze Brahman and Finbarr Timbers and Hamish Ivison and Jacob Morrison and Jake Poznanski and Kyle Lo and Luca Soldaini and Matt Jordan and Mayee Chen and Michael Noukhovitch and Nathan Lambert and Pete Walsh and Pradeep Dasigi and Robert Berry and Saumya Malik and Saurabh Shah and Scott Geng and Shane Arora and Shashank Gupta and Taira Anderson and Teng Xiao and Tyler Murray and Tyler Romero and Victoria Graf and Akari Asai and Akshita Bhagia and Alexander Wettig and Alisa Liu and Aman Rangapur and Chloe Anastasiades and Costa Huang and Dustin Schwenk and Harsh Trivedi and Ian Magnusson and Jaron Lochner and Jiacheng Liu and Lester James V. Miranda and Maarten Sap and Malia Morgan and Michael Schmitz and Michal Guerquin and Michael Wilson and Regan Huff and Ronan Le Bras and Rui Xin and Rulin Shao and Sam Skjonsberg and Shannon Zejiang Shen and Shuyue Stella Li and Tucker Wilde and Valentina Pyatkin and Will Merrill and Yapei Chang and Yuling Gu and Zhiyuan Zeng and Ashish Sabharwal and Luke Zettlemoyer and Pang Wei Koh and Ali Farhadi and Noah A. Smith and Hannaneh Hajishirzi}, year={2025}, eprint={2512.13961}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2512.13961}, } ```
提供机构:
allenai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作