five

iliasslasri/tokenized-OLMoE-mix

收藏
Hugging Face2026-03-15 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/iliasslasri/tokenized-OLMoE-mix
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation language: - en tags: - moE - olmoe - pretraining - allenai pretty_name: Tokenized OLMoE Mix size_categories: - 1B<n<10B --- # Dataset Card for Tokenized OLMoE Mix ## Dataset Summary This dataset contains pre-tokenized training and evaluation data designed for training custom small-scale **OLMoE (Mixture-of-Experts)** models. The data is sourced primarily from the official AI2 Dolma 1 and C4 datasets and was curated to run ablation studies and reproduction experiments related to the [OLMoE Technical Paper (arXiv:2409.02060)](https://arxiv.org/abs/2409.02060). It is provided in `.npy` format, having been pre-tokenized using the `allenai/gpt-neox-olmo-dolma-v1_5` tokenizer. ## Dataset Structure The dataset currently totals **7.38 GB** and is split into three parts for easier handling: * `part-0-00000.npy` (2.51 GB) * `part-0-00001.npy` (4.29 GB) * `part-0-00002.npy` (574 MB) ## Data Sources & Composition Our training mix consists of approximately **4.7 Billion tokens** in total, built from the following sources: ### 1. Training Data (3.689B tokens) * **Source:** A Wikipedia subset from Dolma 1. * **Original HF Dataset:** [`allenai/OLMoE-mix-0924`](https://huggingface.co/datasets/allenai/OLMoE-mix-0924) * **Command used to fetch raw data:** ```bash wget -O data/wiki-001.json.gz "[https://huggingface.co/datasets/allenai/OLMoE-mix-0924/resolve/main/data/wiki/wiki-0001.json.gz?download=true](https://huggingface.co/datasets/allenai/OLMoE-mix-0924/resolve/main/data/wiki/wiki-0001.json.gz?download=true)" ```
提供机构:
iliasslasri
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作