dolma3_mix-6T-1025
收藏魔搭社区2026-01-08 更新2025-11-29 收录
下载链接:
https://modelscope.cn/datasets/allenai/dolma3_mix-6T-1025
下载链接
链接失效反馈官方服务:
资源简介:
# **⚠️ NOTE: Data Coming *very* Soon! ⚠️**
---
<img src="dolma-mix.png" alt="logo for the mix for Dolma 3" width=300>
# Dolma 3 Mix (6T)
The Dolma 3 Mix (6T) is the collection of data used during the pretraining stage to train the Olmo-3-1025-7B model. This dataset is made up of ~6 trillion tokens from a diverse mix of web content, academic publications, code, and more. The majority of this dataset comes from Common Crawl.
For more information on Dolma, please see our original release [here](https://huggingface.co/datasets/allenai/dolma).
## Dataset Sources
### Source Sizes
This dataset contains the full mix of documents used to train Olmo 3 7B.
| Source | Doc Type | Tokens | Bytes (uncompressed) | Documents | License |
|----------------------|------------------|--------|----------------------|-----------|---------|
| common_crawl | web pages | 4.51T | 18.0TB | 3.15B | ODC-BY |
| olmocr_science_pdfs | academic papers | 805B | 3.22TB | 83.8M | ODC-BY |
| stack_edu | code | 409B | 1.64TB | 525.8M | ODC-BY |
| finemath-3plus | mathematics | 151B | 607GB | 95.5M | ODC-BY |
| rpj-proofpile-arxiv | research papers | 50.9B | 203GB | 9.10M | ODC-BY |
| dolma1_7-wiki-en | encyclopedic | 2.51B | 10.0GB | 4.24M | ODC-BY |
| **Total** | | **5.93T** | **23.7TB** | **3.87B** | ODC-BY |
### Mix Compositions
| Source | 6T | |
|----------------------|-------|-------|
| | Source % | Mix % |
| common_crawl | 76.07% | 76.07% |
| olmocr_science_pdfs | 13.57% | 13.57% |
| stack_edu | 6.89% | 6.89% |
| finemath-3plus | 2.56% | 2.56% |
| rpj-proofpile-arxiv | 0.86% | 0.86% |
| dolma1_7-wiki-en | 0.04% | 0.04% |
## Licensing Information
Dolma 3 mix is licensed under the Open Data Commons Attribution License v1.0 (ODC-By). It is intended for research and educational use. For more information, please see our [Responsible Use Guidelines](https://allenai.org/responsible-use).
## Citation
A technical manuscript is forthcoming!
⚠️ 注意:数据即将上线!⚠️
---
<img src="dolma-mix.png" alt="Dolma 3混合数据集标识" width=300>
# Dolma 3 混合数据集(6万亿Token)
Dolma 3 混合数据集(6万亿Token)是用于预训练Olmo-3-1025-7B模型的数据集集合。该数据集包含约6万亿Token,涵盖网页内容、学术出版物、代码等多元数据源,其中绝大多数数据来自通用爬虫 (Common Crawl)。
如需了解Dolma的更多信息,请查看我们的原始发布版本[此处](https://huggingface.co/datasets/allenai/dolma)。
## 数据集来源
### 数据源规模
本数据集包含用于训练Olmo 3 7B模型的全部混合文档。
| 数据源 | 文档类型 | Token数 | 未压缩字节数 | 文档数量 | 许可协议 |
|----------------------|------------------|--------------|----------------------|-----------|---------|
| common_crawl | 网页 | 4.51万亿Token | 18.0TB | 31.5亿 | ODC-BY |
| olmocr_science_pdfs | 学术论文 | 8050亿Token | 3.22TB | 8380万 | ODC-BY |
| stack_edu | 代码 | 4090亿Token | 1.64TB | 52.58亿 | ODC-BY |
| finemath-3plus | 数学资料 | 1510亿Token | 607GB | 9.55亿 | ODC-BY |
| rpj-proofpile-arxiv | 研究论文 | 509亿Token | 203GB | 910万 | ODC-BY |
| dolma1_7-wiki-en | 百科内容 | 25.1亿Token | 10.0GB | 424万 | ODC-BY |
| **总计** | | **5.93万亿Token** | **23.7TB** | **38.7亿** | ODC-BY |
### 数据集构成占比
| 数据源 | 6万亿Token数据集 | |
|----------------------|------------------|-------|
| | 数据源占比 | 总占比 |
| common_crawl | 76.07% | 76.07% |
| olmocr_science_pdfs | 13.57% | 13.57% |
| stack_edu | 6.89% | 6.89% |
| finemath-3plus | 2.56% | 2.56% |
| rpj-proofpile-arxiv | 0.86% | 0.86% |
| dolma1_7-wiki-en | 0.04% | 0.04% |
## 许可信息
Dolma 3混合数据集采用开放数据通用署名许可协议v1.0 (Open Data Commons Attribution License v1.0,ODC-By)进行授权,仅可用于研究与教育用途。如需了解更多信息,请查看我们的[负责任使用指南](https://allenai.org/responsible-use)。
## 引用说明
相关技术文稿即将发布!
提供机构:
maas
创建时间:
2025-11-22



