dolma3_longmino_pool
收藏魔搭社区2026-01-07 更新2025-11-29 收录
下载链接:
https://modelscope.cn/datasets/allenai/dolma3_longmino_pool
下载链接
链接失效反馈官方服务:
资源简介:
⚠️ **IMPORTANT NOTICE** ⚠️
This is the Dolma 3 Longmino **pool**; it hasn't been mixed.
If you are interested in *the data used* to train:
- [Olmo 3 7B](https://huggingface.co/allenai/Olmo-3-1025-7B): [**allenai/dolma3_longmino_mix-50B-1025**](https://huggingface.co/datasets/allenai/dolma3_longmino_mix-50B-1025)
- [Olmo 3 32B](https://huggingface.co/allenai/Olmo-3-1025-32B): [**allenai/dolma3_dolmino_mix-100B-1125**](https://huggingface.co/datasets/allenai/dolma3_longmino_mix-100B-1125)
---
<img alt="Logo for Longmino Pool" src="longmino-pool.png" width="326px" style="margin-left:'auto' margin-right:'auto' display:'block'">
# Dolma 3 Longmino Pool (639B)
Dolma 3 Longmino Pool is the full pool of documents considered for stage 3 (long context) extension trainin of Olmo 3 7B.
### Dataset Sources
| Source | Type | Tokens | Docs |
|--------|------|--------|------|
| LC-s2pdf-REX 32k-64k | Synth PDFs | 24.1B | 492K |
| LC-s2pdf-CWE 32k-64k | Synth PDFs | 8.77B | 189K |
| LC-s2pdf 32k-64k | PDFs | 106B | 2.30M |
| LC-s2pdf 8k-32k (8-16k) | PDFs | 144B | 12.7M |
| LC-s2pdf 8k-32k (16-32k) | PDFs | 115B | 5.06M |
| LC-s2pdf 64k-128k | PDFs | 96.0B | 1.05M |
| LC-s2pdf 128k-256k | PDFs | 60.8B | 342K |
| LC-s2pdf 256k-512k | PDFs | 35.1B | 97.1K |
| LC-s2pdf 512k-1M | PDFs | 21.5B | 30.2K |
| LC-s2pdf 1M+ | PDFs | 26.9B | 12.2K |
| **Total** | | **639B** | **22.3M** |
## Licensing Information
Dolma 3 Longmino is licensed under the Open Data Commons Attribution License v1.0 (ODC-By). It is intended for research and educational use. For more information, please see our [Responsible Use Guidelines](https://allenai.org/responsible-use).
## Citation
A technical manuscript is forthcoming!
⚠️ **重要声明** ⚠️
这是Dolma 3 Longmino**数据集池**,目前尚未经过混合整合。若您希望获取用于训练以下模型的数据集,请访问:
- [Olmo 3 7B](https://huggingface.co/allenai/Olmo-3-1025-7B): [**allenai/dolma3_longmino_mix-50B-1025**](https://huggingface.co/datasets/allenai/dolma3_longmino_mix-50B-1025)
- [Olmo 3 32B](https://huggingface.co/allenai/Olmo-3-1025-32B): [**allenai/dolma3_dolmino_mix-100B-1125**](https://huggingface.co/datasets/allenai/dolma3_longmino_mix-100B-1125)
---
<img alt="Longmino 数据集池标识" src="longmino-pool.png" width="326px" style="margin-left:auto; margin-right:auto; display:block;">
# Dolma 3 Longmino 数据集池(639B)
Dolma 3 Longmino 数据集池是为Olmo 3 7B的第三阶段(长上下文)扩展训练所遴选的全部文档集合。
### 数据集来源
| 数据源 | 类型 | Token (Token) 数量 | 文档数 |
|--------|------|--------|------|
| LC-s2pdf-REX 32k-64k | 合成PDF | 24.1B | 492K |
| LC-s2pdf-CWE 32k-64k | 合成PDF | 8.77B | 189K |
| LC-s2pdf 32k-64k | PDF文档 | 106B | 2.30M |
| LC-s2pdf 8k-32k (8-16k) | PDF文档 | 144B | 12.7M |
| LC-s2pdf 8k-32k (16-32k) | PDF文档 | 115B | 5.06M |
| LC-s2pdf 64k-128k | PDF文档 | 96.0B | 1.05M |
| LC-s2pdf 128k-256k | PDF文档 | 60.8B | 342K |
| LC-s2pdf 256k-512k | PDF文档 | 35.1B | 97.1K |
| LC-s2pdf 512k-1M | PDF文档 | 21.5B | 30.2K |
| LC-s2pdf 1M+ | PDF文档 | 26.9B | 12.2K |
| **总计** | | **639B** | **22.3M** |
## 授权信息
Dolma 3 Longmino 采用开放数据通用署名许可协议v1.0(ODC-By)进行授权,仅可用于研究与教育用途。如需了解更多信息,请参阅我们的[负责任使用指南](https://allenai.org/responsible-use)。
## 引用说明
技术手稿即将发布!
提供机构:
maas
创建时间:
2025-11-22



