iliasslasri/tokenized-OLMoE-mix

Name: iliasslasri/tokenized-OLMoE-mix
Creator: iliasslasri
Published: 2026-03-15 09:23:18
License: 暂无描述

Hugging Face2026-03-15 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/iliasslasri/tokenized-OLMoE-mix

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-generation language: - en tags: - moE - olmoe - pretraining - allenai pretty_name: Tokenized OLMoE Mix size_categories: - 1B<n<10B --- # Dataset Card for Tokenized OLMoE Mix ## Dataset Summary This dataset contains pre-tokenized training and evaluation data designed for training custom small-scale **OLMoE (Mixture-of-Experts)** models. The data is sourced primarily from the official AI2 Dolma 1 and C4 datasets and was curated to run ablation studies and reproduction experiments related to the [OLMoE Technical Paper (arXiv:2409.02060)](https://arxiv.org/abs/2409.02060). It is provided in `.npy` format, having been pre-tokenized using the `allenai/gpt-neox-olmo-dolma-v1_5` tokenizer. ## Dataset Structure The dataset currently totals **7.38 GB** and is split into three parts for easier handling: * `part-0-00000.npy` (2.51 GB) * `part-0-00001.npy` (4.29 GB) * `part-0-00002.npy` (574 MB) ## Data Sources & Composition Our training mix consists of approximately **4.7 Billion tokens** in total, built from the following sources: ### 1. Training Data (3.689B tokens) * **Source:** A Wikipedia subset from Dolma 1. * **Original HF Dataset:** [`allenai/OLMoE-mix-0924`](https://huggingface.co/datasets/allenai/OLMoE-mix-0924) * **Command used to fetch raw data:** ```bash wget -O data/wiki-001.json.gz "[https://huggingface.co/datasets/allenai/OLMoE-mix-0924/resolve/main/data/wiki/wiki-0001.json.gz?download=true](https://huggingface.co/datasets/allenai/OLMoE-mix-0924/resolve/main/data/wiki/wiki-0001.json.gz?download=true)" ```

提供机构：

iliasslasri

5,000+

优质数据集

54 个

任务类型

进入经典数据集