five

openeurollm/dolci-think-sft-tokenized

收藏
Hugging Face2026-02-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/openeurollm/dolci-think-sft-tokenized
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - en tags: - olmo - sft - tokenized - olmo-core size_categories: - 1M<n<10M --- # Dolci-Think-SFT Tokenized Pre-tokenized version of the [allenai/Dolci-Think-SFT-7B](https://huggingface.co/datasets/allenai/Dolci-Think-SFT-7B) dataset, ready for training with [OLMo-core](https://github.com/allenai/OLMo-core). This dataset was used to train the [openeurollm/OLMo-3-7B-Think-SFT](https://huggingface.co/openeurollm/OLMo-3-7B-Think-SFT) checkpoints. See also: [openeurollm/dolci-instruct-sft-tokenized](https://huggingface.co/datasets/openeurollm/dolci-instruct-sft-tokenized) for the instruct (non-thinking) variant. ## Dataset Details | Property | Value | |----------|-------| | Source dataset | [allenai/Dolci-Think-SFT-7B](https://huggingface.co/datasets/allenai/Dolci-Think-SFT-7B) | | Tokenizer | [allenai/Olmo-3-7B-Think-SFT](https://huggingface.co/allenai/Olmo-3-7B-Think-SFT) | | Max sequence length | 32,768 | | Total instances | 2,268,177 | | Total tokens | 22.7B | | Trainable tokens | 22.2B (97.5%) | | Avg tokens per instance | 10,018 | During SFT, only assistant response tokens are trainable. System and user message tokens are masked out via `labels_mask` so the model sees them as context but is not trained to predict them. The high trainable ratio (97.5%) reflects the long chain-of-thought responses in this dataset, where the assistant reasoning dominates each sequence. ## File Format The dataset is stored as pre-merged NumPy arrays compatible with OLMo-core's data loading: - `token_ids_part_XXXX.npy`: token ID arrays (84 parts) - `labels_mask_part_XXXX.npy`: label mask arrays (84 parts), where `1` = trainable (assistant response) and `0` = masked (system/user message) - `tokenizer/`: tokenizer files used during tokenization - `dataset_statistics.json`: detailed statistics about the tokenized dataset ## Usage with OLMo-core Point your OLMo-core training config to this dataset directory. The format is directly compatible with the OLMo-core SFT data loader. ## License Apache 2.0
提供机构:
openeurollm
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作