Xuhui/sft_processed_large_split
收藏Hugging Face2026-04-28 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/Xuhui/sft_processed_large_split
下载链接
链接失效反馈官方服务:
资源简介:
这是Xuhui/sft_processed_large数据集的训练/验证/测试分割版本,称为sft_processed_large — profile-disjoint split。该数据集是OdysSim中期训练语料库,包含63个不同数据集的21.4M交互数据。数据集主要用于文本生成任务,测试集采用了profile-disjoint的分割方式,以确保模型能够处理新角色,而不仅仅是记忆训练集中的(profile, behavior)映射。验证集是从训练集中随机抽取的,用于检查点选择和损失监控。数据集的语言为英语,规模在10M到100M之间。
This is the **train / val / test split** of `Xuhui/sft_processed_large`, the OdysSim midtraining corpus (21.4M interactions across 63 datasets). The test split is profile-disjoint where the datasets profile space supports it, ensuring the model can handle new characters rather than just memorizing the (profile, behavior) mapping seen in training. The val split is a random sample from the train shards, used for checkpoint selection and loss monitoring during midtraining. The dataset is in English and falls under the apache-2.0 license.
提供机构:
Xuhui



