TIDE-dllm/distill_wedlm_sft
收藏Hugging Face2026-04-30 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/TIDE-dllm/distill_wedlm_sft
下载链接
链接失效反馈官方服务:
资源简介:
distill_wedlm_sft数据集是一个预处理的SFT语料库,用于训练TIDE框架中的Shared-Tokenizer (Pipeline B)检查点,这些检查点是从tencent/WeDLM-8B-Instruct蒸馏而来的学生检查点。数据集避免了在训练开始时重新标记化,从而防止多节点运行时的NCCL超时。数据集的组成包括多个来源的数据集,如allenai/tulu-3-sft-mixture、HuggingFaceTB/smoltalk等。数据集的结构包括学生和教师的标记对齐信息,如input_ids、labels、teacher_input_ids等。数据集的使用方法包括加载数据集和调用训练脚本。数据集的构建过程通过脚本完成,许可证为Apache-2.0,并提供了引用信息。
The distill_wedlm_sft dataset is a pre-tokenized SFT corpus used to train every checkpoint in the Shared-Tokenizer (Pipeline B) of the TIDE framework, i.e., the distill-WeDLM-* student checkpoints distilled from tencent/WeDLM-8B-Instruct. The dataset ships as a datasets.DatasetDict to avoid re-tokenization at job start, which would cause NCCL timeouts on multi-node runs. The datasets composition includes multiple source datasets such as allenai/tulu-3-sft-mixture, HuggingFaceTB/smoltalk, etc. The datasets schema includes columns like input_ids, labels, teacher_input_ids, etc., for alignment between student and teacher tokenizations. Usage involves loading the dataset and invoking training scripts. The dataset was built using a preprocessing script, is licensed under Apache-2.0, and includes citation information.
提供机构:
TIDE-dllm



