five

TIDE-dllm/distill_wedlm_sft

收藏
Hugging Face2026-04-30 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/TIDE-dllm/distill_wedlm_sft
下载链接
链接失效反馈
官方服务:
资源简介:
distill_wedlm_sft数据集是一个预处理的SFT语料库,用于训练TIDE框架中的Shared-Tokenizer (Pipeline B)检查点,这些检查点是从tencent/WeDLM-8B-Instruct蒸馏而来的学生检查点。数据集避免了在训练开始时重新标记化,从而防止多节点运行时的NCCL超时。数据集的组成包括多个来源的数据集,如allenai/tulu-3-sft-mixture、HuggingFaceTB/smoltalk等。数据集的结构包括学生和教师的标记对齐信息,如input_ids、labels、teacher_input_ids等。数据集的使用方法包括加载数据集和调用训练脚本。数据集的构建过程通过脚本完成,许可证为Apache-2.0,并提供了引用信息。

The distill_wedlm_sft dataset is a pre-tokenized SFT corpus used to train every checkpoint in the Shared-Tokenizer (Pipeline B) of the TIDE framework, i.e., the distill-WeDLM-* student checkpoints distilled from tencent/WeDLM-8B-Instruct. The dataset ships as a datasets.DatasetDict to avoid re-tokenization at job start, which would cause NCCL timeouts on multi-node runs. The datasets composition includes multiple source datasets such as allenai/tulu-3-sft-mixture, HuggingFaceTB/smoltalk, etc. The datasets schema includes columns like input_ids, labels, teacher_input_ids, etc., for alignment between student and teacher tokenizations. Usage involves loading the dataset and invoking training scripts. The dataset was built using a preprocessing script, is licensed under Apache-2.0, and includes citation information.
提供机构:
TIDE-dllm
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作