openeurollm/dolci-instruct-sft-tokenized
收藏Hugging Face2026-02-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/openeurollm/dolci-instruct-sft-tokenized
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
tags:
- olmo
- sft
- tokenized
- olmo-core
size_categories:
- 1M<n<10M
---
# Dolci-Instruct-SFT Tokenized
Pre-tokenized version of the [allenai/Dolci-Instruct-SFT](https://huggingface.co/datasets/allenai/Dolci-Instruct-SFT) dataset, ready for training with [OLMo-core](https://github.com/allenai/OLMo-core).
See also: [openeurollm/dolci-think-sft-tokenized](https://huggingface.co/datasets/openeurollm/dolci-think-sft-tokenized) for the thinking variant.
## Dataset Details
| Property | Value |
|----------|-------|
| Source dataset | [allenai/Dolci-Instruct-SFT](https://huggingface.co/datasets/allenai/Dolci-Instruct-SFT) |
| Tokenizer | [allenai/Olmo-3-7B-Instruct-SFT](https://huggingface.co/allenai/Olmo-3-7B-Instruct-SFT) |
| Max sequence length | 32,768 |
| Total instances | 2,152,111 |
| Total tokens | 1.7B |
| Trainable tokens | 789M (46.2%) |
| Avg tokens per instance | 793 |
During SFT, only assistant response tokens are trainable. System and user message tokens are masked out via `labels_mask` so the model sees them as context but is not trained to predict them. The lower trainable ratio (46.2%) reflects the shorter assistant responses in this dataset compared to the thinking variant, where system/user prompt tokens make up a larger share of each sequence.
## File Format
The dataset is stored as pre-merged NumPy arrays compatible with OLMo-core's data loading:
- `token_ids_part_XXXX.npy`: token ID arrays
- `labels_mask_part_XXXX.npy`: label mask arrays, where `1` = trainable (assistant response) and `0` = masked (system/user message)
- `tokenizer/`: tokenizer files used during tokenization
- `dataset_statistics.json`: detailed statistics about the tokenized dataset
## Usage with OLMo-core
Point your OLMo-core training config to this dataset directory. The format is directly compatible with the OLMo-core SFT data loader.
## License
Apache 2.0
提供机构:
openeurollm



