five

openeurollm/dolci-instruct-sft-tokenized

收藏
Hugging Face2026-02-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/openeurollm/dolci-instruct-sft-tokenized
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - en tags: - olmo - sft - tokenized - olmo-core size_categories: - 1M<n<10M --- # Dolci-Instruct-SFT Tokenized Pre-tokenized version of the [allenai/Dolci-Instruct-SFT](https://huggingface.co/datasets/allenai/Dolci-Instruct-SFT) dataset, ready for training with [OLMo-core](https://github.com/allenai/OLMo-core). See also: [openeurollm/dolci-think-sft-tokenized](https://huggingface.co/datasets/openeurollm/dolci-think-sft-tokenized) for the thinking variant. ## Dataset Details | Property | Value | |----------|-------| | Source dataset | [allenai/Dolci-Instruct-SFT](https://huggingface.co/datasets/allenai/Dolci-Instruct-SFT) | | Tokenizer | [allenai/Olmo-3-7B-Instruct-SFT](https://huggingface.co/allenai/Olmo-3-7B-Instruct-SFT) | | Max sequence length | 32,768 | | Total instances | 2,152,111 | | Total tokens | 1.7B | | Trainable tokens | 789M (46.2%) | | Avg tokens per instance | 793 | During SFT, only assistant response tokens are trainable. System and user message tokens are masked out via `labels_mask` so the model sees them as context but is not trained to predict them. The lower trainable ratio (46.2%) reflects the shorter assistant responses in this dataset compared to the thinking variant, where system/user prompt tokens make up a larger share of each sequence. ## File Format The dataset is stored as pre-merged NumPy arrays compatible with OLMo-core's data loading: - `token_ids_part_XXXX.npy`: token ID arrays - `labels_mask_part_XXXX.npy`: label mask arrays, where `1` = trainable (assistant response) and `0` = masked (system/user message) - `tokenizer/`: tokenizer files used during tokenization - `dataset_statistics.json`: detailed statistics about the tokenized dataset ## Usage with OLMo-core Point your OLMo-core training config to this dataset directory. The format is directly compatible with the OLMo-core SFT data loader. ## License Apache 2.0
提供机构:
openeurollm
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作