OLMoE-mix-0924
收藏魔搭社区2026-05-16 更新2024-09-14 收录
下载链接:
https://modelscope.cn/datasets/LLM-Research/OLMoE-mix-0924
下载链接
链接失效反馈官方服务:
资源简介:
# OLMoE Mix (September 2024)
## Dataset Description
- **Repository:** https://github.com/allenai/OLMoE
- **Paper:** [OLMoE: Open Mixture-of-Experts Language Models](https://arxiv.org/abs/2409.02060)
<img alt="OLMoE Mix Logo." src="olmoe-mix.png" width="250px">
The following data mix was used to train OLMoE-1B-7B, a Mixture-of-Experts LLM with 1B active and 7B total parameters released in September 2024.
The base version of OLMoE-1B-7B can be found at [this page](https://huggingface.co/allenai/OLMoE-1B-7B-0924), the SFT of OLMoE-1B-7B is available [here](https://huggingface.co/allenai/OLMoE-1B-7B-0924-SFT), and a version combining SFT and DPO is available following [this link](https://huggingface.co/allenai/OLMoE-1B-7B-0924-Instruct).
## Statistics
| Subset | Tokens | Words | Bytes | Docs |
|--------------------------------------------------------------|:----------:|:----------:|:----------:|:----------:|
| [DCLM Baseline 1.0](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0) | 3.86 T | 3.38 T | 16.7 T | 2.95 B |
| [Starcoder](https://huggingface.co/datasets/bigcode/starcoderdata) | 101 B | 63.9 B | 325 B | 78.7 M |
| [peS2o](https://huggingface.co/datasets/allenai/peS2o)<br>([Dolma](https://huggingface.co/datasets/allenai/dolma)) | 57.2 B | 51.3 B | 268 B | 38.8 M |
| Arxiv<br>([RedPajama v1](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) <br>via [Proof Pile II](https://huggingface.co/datasets/EleutherAI/proof-pile-2)) | 21.1 B | 23.5 B | 88.8 B | 1.55 M |
| OpenWebMath<br>([Proof Pile II](https://huggingface.co/datasets/EleutherAI/proof-pile-2)) | 12.7 B | 10.2 B | 42.4 B | 2.91 M |
| Algebraic Stack<br>([Proof Pile II](https://huggingface.co/datasets/EleutherAI/proof-pile-2)) | 12.6 B | 9.6 B | 39.3 B | 2.83 M |
| En Wikipedia + <br>Wikibooks<br>([Dolma](https://huggingface.co/datasets/allenai/dolma)) | 3.69 B | 3.16 B | 16.2 B | 6.17 M |
| **Total** | **4.07 T** | **3.53 T** | **17.4 T** | **3.08 B** |
## Preprocessing
All subsets were pre-processed to remove documents with a *sequence* of 32 or more repeated *ngrams*.
- a *ngram* is a span of 1 to 13 tokens, included;
- *tokens* are obtained using the model tokenizer;
- a *sequence* is a contiguous span of repeated ngrams.
In addition of the above, Starcoder dataset was further processed by removing any document meeting any of the following rules:
- document is from a repository with fewer than 2 stars on GitHub;
- the top most frequent word in the document constitutes over 30% of the document;
- the two most frequent words in the document constitutes over 50% of the document.
## Licensing Information
This mix is licensed under [Open Data Commons Attribution License (ODC-By) v1.0](https://opendatacommons.org/licenses/by/1-0/). By using this dataset, you are bound to licenses and Terms of Services of underlying datasets, which you can access by clicking on the links in the table above.
## Citation
```bibtex
@misc{muennighoff2024olmoeopenmixtureofexpertslanguage,
title={OLMoE: Open Mixture-of-Experts Language Models},
author={Niklas Muennighoff and Luca Soldaini and Dirk Groeneveld and Kyle Lo and Jacob Morrison and Sewon Min and Weijia Shi and Pete Walsh and Oyvind Tafjord and Nathan Lambert and Yuling Gu and Shane Arora and Akshita Bhagia and Dustin Schwenk and David Wadden and Alexander Wettig and Binyuan Hui and Tim Dettmers and Douwe Kiela and Ali Farhadi and Noah A. Smith and Pang Wei Koh and Amanpreet Singh and Hannaneh Hajishirzi},
year={2024},
eprint={2409.02060},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.02060},
}
```
# OLMoE 混合数据集(2024年9月版)
## 数据集说明
- **仓库地址**:https://github.com/allenai/OLMoE
- **论文链接**:[OLMoE:开放混合专家大语言模型](https://arxiv.org/abs/2409.02060)
<img alt="OLMoE 混合数据集标识。" src="olmoe-mix.png" width="250px">
本数据集混合方案用于训练OLMoE-1B-7B,这是一款2024年9月发布的混合专家大语言模型(Mixture-of-Experts LLM),拥有10亿激活参数与总计70亿参数。
OLMoE-1B-7B的基础版本可在[此页面](https://huggingface.co/allenai/OLMoE-1B-7B-0924)获取;其监督微调(Supervised Fine-Tuning, SFT)版本可通过[此处](https://huggingface.co/allenai/OLMoE-1B-7B-0924-SFT)获取;结合了监督微调与直接偏好优化(Direct Preference Optimization, DPO)的指令微调版本可通过[此链接](https://huggingface.co/allenai/OLMoE-1B-7B-0924-Instruct)获取。
## 统计信息
| 子集 | Token数 | 词数 | 字节数 | 文档数 |
|:------------------------------------------------------------ |:----------:|:----------:|:----------:|:----------:|
| [DCLM 基准数据集1.0](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0) | 3.86 T | 3.38 T | 16.7 T | 2.95 B |
| [Starcoder 代码数据集](https://huggingface.co/datasets/bigcode/starcoderdata) | 101 B | 63.9 B | 325 B | 78.7 M |
| [peS2o 数据集](https://huggingface.co/datasets/allenai/peS2o)<br>([Dolma 数据集](https://huggingface.co/datasets/allenai/dolma)) | 57.2 B | 51.3 B | 268 B | 38.8 M |
| Arxiv 预印本数据集<br>([RedPajama v1 数据集](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) <br>通过 [Proof Pile II 数据集](https://huggingface.co/datasets/EleutherAI/proof-pile-2) 引入) | 21.1 B | 23.5 B | 88.8 B | 1.55 M |
| [OpenWebMath 数学数据集](https://huggingface.co/datasets/EleutherAI/proof-pile-2) | 12.7 B | 10.2 B | 42.4 B | 2.91 M |
| [Algebraic Stack 代数栈数据集](https://huggingface.co/datasets/EleutherAI/proof-pile-2) | 12.6 B | 9.6 B | 39.3 B | 2.83 M |
| 英文维基百科 + 维基教科书<br>([Dolma 数据集](https://huggingface.co/datasets/allenai/dolma)) | 3.69 B | 3.16 B | 16.2 B | 6.17 M |
| **总计** | **4.07 T** | **3.53 T** | **17.4 T** | **3.08 B** |
## 预处理流程
所有子集均经过预处理,移除了包含连续32个及以上重复n元语法组(ngram)的文档。其中:
- n元语法组(ngram)指包含1至13个Token的连续文本片段(包含两端);
- Token通过模型分词器进行提取;
- 重复序列(sequence)指连续出现的重复ngram片段。
此外,Starcoder代码数据集还进行了额外预处理,将移除符合以下任一规则的文档:
1. 来自GitHub星标数少于2的代码仓库的文档;
2. 文档中出现频率最高的单词占总词数的比例超过30%;
3. 文档中出现频率最高的两个单词占总词数的比例超过50%。
## 授权信息
本数据集混合方案采用[开放数据 Commons 署名许可协议(Open Data Commons Attribution License, ODC-By)v1.0](https://opendatacommons.org/licenses/by/1-0/)进行授权。使用本数据集即意味着您需遵守其底层数据集的许可协议与服务条款,相关链接可通过上表中的数据集链接获取。
## 引用格式
bibtex
@misc{muennighoff2024olmoeopenmixtureofexpertslanguage,
title={OLMoE: Open Mixture-of-Experts Language Models},
author={Niklas Muennighoff and Luca Soldaini and Dirk Groeneveld and Kyle Lo and Jacob Morrison and Sewon Min and Weijia Shi and Pete Walsh and Oyvind Tafjord and Nathan Lambert and Yuling Gu and Shane Arora and Akshita Bhagia and Dustin Schwenk and David Wadden and Alexander Wettig and Binyuan Hui and Tim Dettmers and Douwe Kiela and Ali Farhadi and Noah A. Smith and Pang Wei Koh and Amanpreet Singh and Hannaneh Hajishirzi},
year={2024},
eprint={2409.02060},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.02060},
}
提供机构:
maas
创建时间:
2024-09-04



