openeurollm/dolci-instruct-sft-tokenized

Name: openeurollm/dolci-instruct-sft-tokenized
Creator: openeurollm
Published: 2026-02-26 14:54:39
License: 暂无描述

Hugging Face2026-02-26 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/openeurollm/dolci-instruct-sft-tokenized

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 language: - en tags: - olmo - sft - tokenized - olmo-core size_categories: - 1M<n<10M --- # Dolci-Instruct-SFT Tokenized Pre-tokenized version of the [allenai/Dolci-Instruct-SFT](https://huggingface.co/datasets/allenai/Dolci-Instruct-SFT) dataset, ready for training with [OLMo-core](https://github.com/allenai/OLMo-core). See also: [openeurollm/dolci-think-sft-tokenized](https://huggingface.co/datasets/openeurollm/dolci-think-sft-tokenized) for the thinking variant. ## Dataset Details | Property | Value | |----------|-------| | Source dataset | [allenai/Dolci-Instruct-SFT](https://huggingface.co/datasets/allenai/Dolci-Instruct-SFT) | | Tokenizer | [allenai/Olmo-3-7B-Instruct-SFT](https://huggingface.co/allenai/Olmo-3-7B-Instruct-SFT) | | Max sequence length | 32,768 | | Total instances | 2,152,111 | | Total tokens | 1.7B | | Trainable tokens | 789M (46.2%) | | Avg tokens per instance | 793 | During SFT, only assistant response tokens are trainable. System and user message tokens are masked out via `labels_mask` so the model sees them as context but is not trained to predict them. The lower trainable ratio (46.2%) reflects the shorter assistant responses in this dataset compared to the thinking variant, where system/user prompt tokens make up a larger share of each sequence. ## File Format The dataset is stored as pre-merged NumPy arrays compatible with OLMo-core's data loading: - `token_ids_part_XXXX.npy`: token ID arrays - `labels_mask_part_XXXX.npy`: label mask arrays, where `1` = trainable (assistant response) and `0` = masked (system/user message) - `tokenizer/`: tokenizer files used during tokenization - `dataset_statistics.json`: detailed statistics about the tokenized dataset ## Usage with OLMo-core Point your OLMo-core training config to this dataset directory. The format is directly compatible with the OLMo-core SFT data loader. ## License Apache 2.0

提供机构：

openeurollm

5,000+

优质数据集

54 个

任务类型

进入经典数据集