TIDE-dllm/distill_wedlm_sft

Name: TIDE-dllm/distill_wedlm_sft
Creator: TIDE-dllm
Published: 2026-04-30 02:40:36
License: 暂无描述

Hugging Face2026-04-30 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/TIDE-dllm/distill_wedlm_sft

下载链接

链接失效反馈

官方服务：

资源简介：

distill_wedlm_sft数据集是一个预处理的SFT语料库，用于训练TIDE框架中的Shared-Tokenizer (Pipeline B)检查点，这些检查点是从tencent/WeDLM-8B-Instruct蒸馏而来的学生检查点。数据集避免了在训练开始时重新标记化，从而防止多节点运行时的NCCL超时。数据集的组成包括多个来源的数据集，如allenai/tulu-3-sft-mixture、HuggingFaceTB/smoltalk等。数据集的结构包括学生和教师的标记对齐信息，如input_ids、labels、teacher_input_ids等。数据集的使用方法包括加载数据集和调用训练脚本。数据集的构建过程通过脚本完成，许可证为Apache-2.0，并提供了引用信息。

The distill_wedlm_sft dataset is a pre-tokenized SFT corpus used to train every checkpoint in the Shared-Tokenizer (Pipeline B) of the TIDE framework, i.e., the distill-WeDLM-* student checkpoints distilled from tencent/WeDLM-8B-Instruct. The dataset ships as a datasets.DatasetDict to avoid re-tokenization at job start, which would cause NCCL timeouts on multi-node runs. The datasets composition includes multiple source datasets such as allenai/tulu-3-sft-mixture, HuggingFaceTB/smoltalk, etc. The datasets schema includes columns like input_ids, labels, teacher_input_ids, etc., for alignment between student and teacher tokenizations. Usage involves loading the dataset and invoking training scripts. The dataset was built using a preprocessing script, is licensed under Apache-2.0, and includes citation information.

提供机构：

TIDE-dllm

5,000+

优质数据集

54 个

任务类型

进入经典数据集