Korea-MES/Token-Upperbound-V3

Name: Korea-MES/Token-Upperbound-V3
Creator: Korea-MES
Published: 2025-12-14 08:51:13
License: 暂无描述

Hugging Face2025-12-14 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/Korea-MES/Token-Upperbound-V3

下载链接

链接失效反馈

官方服务：

资源简介：

Token-Upperbound-V3数据集是第三版的最大长度标记数据集，合并了来自Hermes2.5和Mixtral-V2的高质量指令跟随数据。该数据集通过最大长度标记（MLT）控制响应长度，提供了多样化和平衡的指令-响应对。数据集包含多个特征，如问题、答案、上下文、MLT标记、令牌长度和来源标识。MLT标记范围从非常短的响应（≤5-10令牌）到扩展响应（≤800-1024令牌）。数据集统计显示，训练集约680K样本，测试集约1,600-2,000样本，来源包括Hermes2、MetaMath、LMSYS_Chat等多种数据源。数据集还提供了使用建议、训练推荐和已知问题。

Token-Upperbound-V3 is the third version of the Token Upperbound Dataset, merging high-quality instruction-following data from Hermes2.5 and Mixtral-V2. The dataset provides diverse and balanced instruction-response pairs with explicit length control through Maximum Length Token (MLT) markers. It includes features such as question, answer, context, MLT tag, token length, and source identifier. MLT tags range from very short responses (≤5-10 tokens) to extended responses (≤800-1024 tokens). Dataset statistics show ~680K samples in the train split and ~1,600-2,000 samples in the test split, sourced from various origins including Hermes2, MetaMath, LMSYS_Chat, and more. The README also provides usage instructions, training recommendations, and known issues.

提供机构：

Korea-MES

5,000+

优质数据集

54 个

任务类型

进入经典数据集