dolma3_mix-5.5T-1125
收藏魔搭社区2026-01-06 更新2025-12-27 收录
下载链接:
https://modelscope.cn/datasets/allenai/dolma3_mix-5.5T-1125
下载链接
链接失效反馈官方服务:
资源简介:
---
<img src="https://cdn-uploads.huggingface.co/production/uploads/65316953791d5a2611426c20/JopP0oxXQlhiB7YHQGZhY.png" width="300" alt="dolma-mix">
# Dolma 3 Mix (6T)
The Dolma 3 Mix (6T) is the collection of data used during the pretraining stage to train the Olmo-3-1125-32B model. This dataset is made up of ~6 trillion tokens from a diverse mix of web content, academic publications, code, and more. The majority of this dataset comes from Common Crawl.
For more information on Dolma, please see our original release [here](https://huggingface.co/datasets/allenai/dolma).
## Licensing Information
Dolma 3 mix is licensed under the Open Data Commons Attribution License v1.0 (ODC-By). It is intended for research and educational use. For more information, please see our [Responsible Use Guidelines](https://allenai.org/responsible-use).
## Citation
A technical manuscript is forthcoming! Find the paper at: https://allenai.org/papers/olmo3
<img src="https://cdn-uploads.huggingface.co/production/uploads/65316953791d5a2611426c20/JopP0oxXQlhiB7YHQGZhY.png" width="300" alt="dolma-mix">
# Dolma 3 混合数据集(6万亿Token)
该数据集是用于预训练Olmo-3-1125-32B模型的数据集集合,包含约6万亿Token,涵盖网络内容、学术出版物、代码等多元数据来源,其中大部分数据来自Common Crawl(通用爬虫数据集)。
如需了解Dolma的更多信息,请参阅我们的原始发布版本[此处](https://huggingface.co/datasets/allenai/dolma)。
## 许可信息
Dolma 3 混合数据集采用Open Data Commons Attribution License v1.0(ODC-By)许可协议进行授权,仅可用于研究与教育用途。如需了解更多相关信息,请参阅我们的[负责任使用指南](https://allenai.org/responsible-use)。
## 引用规范
相关技术手稿即将发布!论文链接:https://allenai.org/papers/olmo3
提供机构:
maas
创建时间:
2025-11-25



