Korea-MES/Mixtral-Upperbound-V5
收藏Hugging Face2025-12-17 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Korea-MES/Mixtral-Upperbound-V5
下载链接
链接失效反馈官方服务:
资源简介:
这是`Korea-MES/Mixtral-Upperbound-V4`数据集的改进版本,具有**更严格的MLT(最大长度令牌)范围**和**平衡的测试集**,以便在训练和评估期间更好地控制令牌。原始V4数据集存在两个主要问题:1. MLT范围过于宽泛,导致训练时混淆;2. 测试集不平衡,MLT分布不均。V5版本使用基于`Qwen/Qwen2.5-0.5B-Instruct`分词器的实际令牌计数的**紧密、精确的MLT范围**,以及**平衡的测试集**(每个MLT200个样本,按来源分层)。数据集包含多种MLT标签,每个标签有明确的令牌范围,并提供了详细的样本分布统计信息。
This is a refined version of `Korea-MES/Mixtral-Upperbound-V4` with **tightened MLT (Maximum Length Token) ranges** and a **balanced test set** for better token control during training and evaluation. The original V4 dataset had two main issues: 1. Overly broad MLT ranges, causing confusion during training; 2. Imbalanced test set with uneven MLT distribution. V5 uses **tight, precise MLT ranges** based on actual token counts using `Qwen/Qwen2.5-0.5B-Instruct` tokenizer and a **balanced test set** with 200 samples per MLT, stratified by source. The dataset includes various MLT labels with precise token ranges and provides detailed statistics on sample distribution.
提供机构:
Korea-MES



