iliasslasri/tokenized-OLMoE-mix
收藏Hugging Face2026-03-15 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/iliasslasri/tokenized-OLMoE-mix
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
language:
- en
tags:
- moE
- olmoe
- pretraining
- allenai
pretty_name: Tokenized OLMoE Mix
size_categories:
- 1B<n<10B
---
# Dataset Card for Tokenized OLMoE Mix
## Dataset Summary
This dataset contains pre-tokenized training and evaluation data designed for training custom small-scale **OLMoE (Mixture-of-Experts)** models.
The data is sourced primarily from the official AI2 Dolma 1 and C4 datasets and was curated to run ablation studies and reproduction experiments related to the [OLMoE Technical Paper (arXiv:2409.02060)](https://arxiv.org/abs/2409.02060). It is provided in `.npy` format, having been pre-tokenized using the `allenai/gpt-neox-olmo-dolma-v1_5` tokenizer.
## Dataset Structure
The dataset currently totals **7.38 GB** and is split into three parts for easier handling:
* `part-0-00000.npy` (2.51 GB)
* `part-0-00001.npy` (4.29 GB)
* `part-0-00002.npy` (574 MB)
## Data Sources & Composition
Our training mix consists of approximately **4.7 Billion tokens** in total, built from the following sources:
### 1. Training Data (3.689B tokens)
* **Source:** A Wikipedia subset from Dolma 1.
* **Original HF Dataset:** [`allenai/OLMoE-mix-0924`](https://huggingface.co/datasets/allenai/OLMoE-mix-0924)
* **Command used to fetch raw data:**
```bash
wget -O data/wiki-001.json.gz "[https://huggingface.co/datasets/allenai/OLMoE-mix-0924/resolve/main/data/wiki/wiki-0001.json.gz?download=true](https://huggingface.co/datasets/allenai/OLMoE-mix-0924/resolve/main/data/wiki/wiki-0001.json.gz?download=true)"
```
提供机构:
iliasslasri



