Step-3.5-Flash-SFT
收藏魔搭社区2026-05-16 更新2026-03-29 收录
下载链接:
https://modelscope.cn/datasets/stepfun-ai/Step-3.5-Flash-SFT
下载链接
链接失效反馈官方服务:
资源简介:
# Step-3.5-Flash-SFT
`Step-3.5-Flash-SFT` is a general-domain supervised fine-tuning release for chat models.
This repository keeps the full training interface in one place:
- `json/`: canonical raw training data
- `tokenizers/`: tokenizer snapshots for Step-3.5-Flash and Qwen3, released to preserve chat-template alignment
- `compiled/`: tokenizer-specific compiled shards for StepTronOSS training
## Data Format
Each raw shard is a JSON file whose top level is a list of examples. Each example currently contains a `conversations` field with ordered message turns.
```json
{
"conversations": [
{
"role": "user",
"content": "...",
"loss_mask": 1,
"name": "",
"meta": {}
},
{
"role": "assistant",
"content": "...",
"loss_mask": 1,
"name": "",
"meta": {},
"reasoning_content": "..."
}
]
}
```
Observed fields:
- `role`: speaker role such as `user` or `assistant`
- `content`: visible message text
- `loss_mask`: turn-level supervision flag
- `name`: optional speaker name
- `meta`: per-turn metadata
- `reasoning_content`: optional assistant-side field present in some examples
## Recommended Training Framework
The official recommended training framework for this release is [StepTronOSS](https://github.com/stepfun-ai/SteptronOss).
Reproduce our experiment: [step3p5_flash_sft_step3_data_muon.py](https://github.com/stepfun-ai/SteptronOss/blob/dev/playground/sft/step3/step3p5_flash_sft_step3_data_muon.py)
Important compatibility rules:
- Train this dataset with `Sequential sampler`. Do not shuffle when reproducing the StepTronOSS recipe.
- Do not mix tokenizer variants and compiled variants.
- Use `transformers<5.0` for `apply_chat_template(...)`.
- Raw JSON is the origin. Compiled shards are tokenizer-specific acceleration artifacts for StepTronOSS.
## StepTronOSS Training Semantics
In the StepTronOSS reference SFT path, dialogs are tokenized with the selected Hugging Face chat template and then packed sequentially. Training loss is applied only to assistant tokens after the last user turn when that turn has `loss_mask = 1`.
This means tokenizer snapshots are part of the effective training definition, not optional metadata.
## Using With StepTronOSS
Reference recipes:
- Step-3.5-Flash uses `Recipe0311CompiledSFTDataConfig`
- Qwen3 uses `Recipe0311QwenCompiledSFTDataConfig`
Representative StepTronOSS files:
- `playground/sft/step3/step3_flash_sft_step3_data_muon.py`
- `playground/sft/qwen3/qwen3_30a3b_sft_step3_data.py`
Minimal local pattern for Step-3.5-Flash compiled data:
```python
from pathlib import Path
from playground.data.sft.oss260312.step_sft_data_config0311 import Recipe0311SFTDataConfig
from playground.tools.compile_recipe import CompiledDataRecipe, CompiledDatasetsConfig
from playground.pretrain.step3p5.step3p5_flash import Step3p5FlashModelConfig
from playground.sft.qwen3.qwen3_sft_base import Exp as BaseExp
DATA_ROOT = Path("/path/to/Step-3.5-Flash-SFT")
class PublicStep3p5CompiledDatasetsConfig(CompiledDatasetsConfig):
compiled_recipe = CompiledDataRecipe(
domains={"general": str(DATA_ROOT / "compiled/step3p5_flash/general")},
epochs={"general": 1},
)
class PublicStep3p5CompiledSFTDataConfig(Recipe0311SFTDataConfig):
dataset_cfg = PublicStep3p5CompiledDatasetsConfig
class Exp(BaseExp):
model_cfg = Step3p5FlashModelConfig
data_cfg = PublicStep3p5CompiledSFTDataConfig
```
Run with the matching tokenizer:
```bash
torchrun --standalone --nproc-per-node=8 your_exp.py \
tokenizer_cfg.tokenizer_path=/path/to/Step-3.5-Flash-SFT/tokenizers/step3p5_flash
```
For Qwen3, point the compiled data root to `compiled/qwen3/general` and the tokenizer path to `tokenizers/qwen3`.
## How To Compile
StepTronOSS provides tokenizer-specific compile entrypoints:
- `playground/data/sft/oss260312/step_sft_data_config0311_step3p5_tokenizer.py`
- `playground/data/sft/oss260312/step_sft_data_config0311_qwen_tokenizer.py`
Compile Step-3.5-Flash:
```bash
cd /path/to/SteptronOss
python3 playground/data/sft/oss260312/step_sft_data_config0311_step3p5_tokenizer.py \
--tokenizer-path /path/to/Step-3.5-Flash-SFT/tokenizers/step3p5_flash
```
Compile Qwen3:
```bash
cd /path/to/SteptronOss
python3 playground/data/sft/oss260312/step_sft_data_config0311_qwen_tokenizer.py \
--tokenizer-path /path/to/Step-3.5-Flash-SFT/tokenizers/qwen3
```
Notes:
- Use the tokenizer snapshot that matches the compiled variant you want to generate.
- Keep `transformers<5.0` in the compile environment.
- The reference scripts write to tokenizer-specific `COMPILED_ROOT_...` paths defined in the scripts.
## Raw JSON Loading
Raw JSON shards can also be inspected with `datasets`:
```python
from datasets import load_dataset
dataset = load_dataset("json", data_files={"train": "json/general/chunk_*.json"})
print(dataset["train"][0]["conversations"])
```
## Notes
- This is a training corpus, not a benchmark.
- Some assistant turns include `reasoning_content` in addition to final `content`; downstream users may keep, remove, or transform that field depending on their training recipe.
- Compiled datasets are StepTronOSS-specific derived artifacts and are not intended as a framework-agnostic exchange format.
## Responsible Data Disclosure: Advancing Open Source While Safeguarding Commercial Sustainability
We are committed to open data and research, and strive to strike a balance between maximal transparency and the protection of legitimate commercial interests.
## License
This dataset is made available under both Apache-2.0 and CC-BY-NC-2.0.
For the avoidance of doubt, use of this dataset requires compliance with both licenses simultaneously. This is not an alternative-license grant, and users may not choose to comply with only one of the two licenses.
# Step-3.5-Flash-SFT
`Step-3.5-Flash-SFT` 是一款面向对话模型的通用领域有监督微调开源数据集。
本仓库统一集成了完整的训练相关接口与资源,具体目录结构如下:
- `json/`:标准化的原始训练数据
- `tokenizers/`:针对Step-3.5-Flash与Qwen3的分词器(Tokenizer)快照,发布该快照以确保对话模板对齐一致性
- `compiled/`:适配特定分词器的编译分片文件,用于StepTronOSS训练
## 数据格式
每个原始分片均为JSON文件,其顶层结构为示例列表。当前每个示例均包含`conversations`字段,用于存储有序的对话轮次。
json
{
"conversations": [
{
"role": "user",
"content": "...",
"loss_mask": 1,
"name": "",
"meta": {}
},
{
"role": "assistant",
"content": "...",
"loss_mask": 1,
"name": "",
"meta": {},
"reasoning_content": "..."
}
]
}
已支持的字段说明如下:
- `role`:说话者角色,例如`user`(用户)或`assistant`(助手)
- `content`:可见的消息文本
- `loss_mask`:轮次级别的监督标记
- `name`:可选的说话者名称
- `meta`:单轮元数据
- `reasoning_content`:部分示例中包含的助手侧可选字段
## 推荐训练框架
本发布的官方推荐训练框架为 [StepTronOSS](https://github.com/stepfun-ai/SteptronOss)。
复现本实验的脚本为:[step3p5_flash_sft_step3_data_muon.py](https://github.com/stepfun-ai/SteptronOss/blob/dev/playground/sft/step3/step3p5_flash_sft_step3_data_muon.py)
### 重要兼容性规则
- 使用`Sequential sampler`(顺序采样器)对本数据集进行训练,复现StepTronOSS官方流程时请勿打乱数据顺序
- 请勿混用不同版本的分词器与编译分片
- 调用`apply_chat_template(...)`时需使用`transformers<5.0`版本
## StepTronOSS 训练语义
在StepTronOSS的标准有监督微调流程中,对话会先通过选定的Hugging Face对话模板完成分词,随后按顺序打包。仅当最后一轮用户发言的`loss_mask=1`时,才会对该轮之后的助手回复Token计算训练损失。
这表明分词器快照属于有效训练配置的一部分,而非可选的元数据。
## 结合StepTronOSS使用
参考配置方案如下:
- Step-3.5-Flash 对应`Recipe0311CompiledSFTDataConfig`配置类
- Qwen3 对应`Recipe0311QwenCompiledSFTDataConfig`配置类
典型StepTronOSS文件包括:
- `playground/sft/step3/step3_flash_sft_step3_data_muon.py`
- `playground/sft/qwen3/qwen3_30a3b_sft_step3_data.py`
Step-3.5-Flash编译数据的极简本地配置示例如下:
python
from pathlib import Path
from playground.data.sft.oss260312.step_sft_data_config0311 import Recipe0311SFTDataConfig
from playground.tools.compile_recipe import CompiledDataRecipe, CompiledDatasetsConfig
from playground.pretrain.step3p5.step3p5_flash import Step3p5FlashModelConfig
from playground.sft.qwen3.qwen3_sft_base import Exp as BaseExp
DATA_ROOT = Path("/path/to/Step-3.5-Flash-SFT")
class PublicStep3p5CompiledDatasetsConfig(CompiledDatasetsConfig):
compiled_recipe = CompiledDataRecipe(
domains={"general": str(DATA_ROOT / "compiled/step3p5_flash/general")},
epochs={"general": 1},
)
class PublicStep3p5CompiledSFTDataConfig(Recipe0311SFTDataConfig):
dataset_cfg = PublicStep3p5CompiledDatasetsConfig
class Exp(BaseExp):
model_cfg = Step3p5FlashModelConfig
data_cfg = PublicStep3p5CompiledSFTDataConfig
使用匹配的分词器运行脚本:
bash
torchrun --standalone --nproc-per-node=8 your_exp.py
tokenizer_cfg.tokenizer_path=/path/to/Step-3.5-Flash-SFT/tokenizers/step3p5_flash
若使用Qwen3模型,则需将编译数据根目录指向`compiled/qwen3/general`,分词器路径指向`tokenizers/qwen3`。
## 编译方法
StepTronOSS提供了适配特定分词器的编译入口:
- `playground/data/sft/oss260312/step_sft_data_config0311_step3p5_tokenizer.py`
- `playground/data/sft/oss260312/step_sft_data_config0311_qwen_tokenizer.py`
编译Step-3.5-Flash的命令如下:
bash
cd /path/to/SteptronOss
python3 playground/data/sft/oss260312/step_sft_data_config0311_step3p5_tokenizer.py
--tokenizer-path /path/to/Step-3.5-Flash-SFT/tokenizers/step3p5_flash
编译Qwen3的命令如下:
bash
cd /path/to/SteptronOss
python3 playground/data/sft/oss260312/step_sft_data_config0311_qwen_tokenizer.py
--tokenizer-path /path/to/Step-3.5-Flash-SFT/tokenizers/qwen3
注意事项:
- 请使用与待生成编译分片匹配的分词器快照
- 编译环境需使用`transformers<5.0`版本
- 官方参考脚本会将编译产物写入脚本中定义的、适配对应分词器的`COMPILED_ROOT_...`路径中
## 原始JSON数据加载
也可通过`datasets`库加载并查看原始JSON分片:
python
from datasets import load_dataset
dataset = load_dataset("json", data_files={"train": "json/general/chunk_*.json"})
print(dataset["train"][0]["conversations"])
## 补充说明
- 本数据集为训练语料,而非评测基准
- 部分助手轮次除了最终的`content`字段外,还包含`reasoning_content`字段;下游使用者可根据自身训练需求,选择保留、移除或转换该字段
- 编译后的数据集为StepTronOSS专属的衍生产物,并非通用的跨框架交换格式
## 负责任的数据披露:在推进开源的同时保障商业可持续性
我们始终致力于开源数据与研究,并力求在最大程度的透明度与保护合法商业利益之间取得平衡。
## 许可证
本数据集同时采用Apache-2.0与CC-BY-NC-2.0双重许可证协议发布。
需明确说明的是,使用本数据集需同时遵守两份许可证的全部条款,本发布并非提供二选一的许可授权,用户不得仅选择遵守其中一份许可证。
提供机构:
maas
创建时间:
2026-03-15



