openeurollm/dolci-think-sft-tokenized

Name: openeurollm/dolci-think-sft-tokenized
Creator: openeurollm
Published: 2026-02-26 14:54:39
License: 暂无描述

Hugging Face2026-02-26 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/openeurollm/dolci-think-sft-tokenized

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 language: - en tags: - olmo - sft - tokenized - olmo-core size_categories: - 1M<n<10M --- # Dolci-Think-SFT Tokenized Pre-tokenized version of the [allenai/Dolci-Think-SFT-7B](https://huggingface.co/datasets/allenai/Dolci-Think-SFT-7B) dataset, ready for training with [OLMo-core](https://github.com/allenai/OLMo-core). This dataset was used to train the [openeurollm/OLMo-3-7B-Think-SFT](https://huggingface.co/openeurollm/OLMo-3-7B-Think-SFT) checkpoints. See also: [openeurollm/dolci-instruct-sft-tokenized](https://huggingface.co/datasets/openeurollm/dolci-instruct-sft-tokenized) for the instruct (non-thinking) variant. ## Dataset Details | Property | Value | |----------|-------| | Source dataset | [allenai/Dolci-Think-SFT-7B](https://huggingface.co/datasets/allenai/Dolci-Think-SFT-7B) | | Tokenizer | [allenai/Olmo-3-7B-Think-SFT](https://huggingface.co/allenai/Olmo-3-7B-Think-SFT) | | Max sequence length | 32,768 | | Total instances | 2,268,177 | | Total tokens | 22.7B | | Trainable tokens | 22.2B (97.5%) | | Avg tokens per instance | 10,018 | During SFT, only assistant response tokens are trainable. System and user message tokens are masked out via `labels_mask` so the model sees them as context but is not trained to predict them. The high trainable ratio (97.5%) reflects the long chain-of-thought responses in this dataset, where the assistant reasoning dominates each sequence. ## File Format The dataset is stored as pre-merged NumPy arrays compatible with OLMo-core's data loading: - `token_ids_part_XXXX.npy`: token ID arrays (84 parts) - `labels_mask_part_XXXX.npy`: label mask arrays (84 parts), where `1` = trainable (assistant response) and `0` = masked (system/user message) - `tokenizer/`: tokenizer files used during tokenization - `dataset_statistics.json`: detailed statistics about the tokenized dataset ## Usage with OLMo-core Point your OLMo-core training config to this dataset directory. The format is directly compatible with the OLMo-core SFT data loader. ## License Apache 2.0

提供机构：

openeurollm

5,000+

优质数据集

54 个

任务类型

进入经典数据集