Korea-MES/Mixtral-Upperbound-V2
收藏Hugging Face2025-12-07 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Korea-MES/Mixtral-Upperbound-V2
下载链接
链接失效反馈官方服务:
资源简介:
# Mixtral-Upperbound Dataset
Multi-Length Token (MLT) 학습을 위한 데이터셋입니다.
## Updates
- **MetaMathQA-40K 추가**: MATH_* + GSM_* 전체 타입 추가
- **GPQA 추가**: lm-eval 형식으로 변환된 GPQA 데이터
- **MetaMath 제외**: 벤치마크 오염 방지를 위해 기존 MetaMath 샘플 제거
- **LMSYS_Chat 정제**: Alpaca 프롬프트 템플릿 제거
## Dataset Statistics
### Splits
| Split | Samples |
|-------|---------|
| train | 525,948 |
| test | 2,000 |
### Source Distribution (Train)
| Source | Count |
|--------|-------|
| LMSYS_Chat | 178,700 |
| MMLU | 99,482 |
| UltraFeedback | 59,837 |
| Winogrande | 40,285 |
| HellaSwag | 39,707 |
| Tulu3_IF | 28,652 |
| MetaMathQA_GSM | 24,180 |
| PIQA | 16,058 |
| MetaMathQA_MATH | 15,722 |
| NoRobots | 9,356 |
| GSM8K | 7,422 |
| GPQA | 2,373 |
| ARC_Test_Easy | 2,238 |
| ARC_Train | 1,111 |
| LIMA | 825 |
### MLT Distribution (Train)
| MLT | Count |
|-----|-------|
| [MLT:5] | 76,966 |
| [MLT:10] | 56,716 |
| [MLT:30] | 71,821 |
| [MLT:50] | 34,321 |
| [MLT:80] | 28,703 |
| [MLT:150] | 57,437 |
| [MLT:300] | 86,280 |
| [MLT:500] | 72,569 |
| [MLT:700] | 32,929 |
| [MLT:800] | 8,206 |
### MLT Distribution (Test)
| MLT | Count |
|-----|-------|
| [MLT:5] | 200 |
| [MLT:10] | 200 |
| [MLT:30] | 200 |
| [MLT:50] | 200 |
| [MLT:80] | 200 |
| [MLT:150] | 200 |
| [MLT:300] | 200 |
| [MLT:500] | 200 |
| [MLT:700] | 200 |
| [MLT:800] | 200 |
## MLT Bounds
| MLT Value | Token Range |
|-----------|-------------|
| 5 | 1-5 |
| 10 | 6-10 |
| 30 | 20-30 |
| 50 | 40-50 |
| 80 | 60-80 |
| 150 | 130-150 |
| 300 | 200-300 |
| 500 | 400-500 |
| 700 | 500-650 |
| 800 | 651-800 |
## Usage
```python
from datasets import load_dataset
ds = load_dataset("Korea-MES/Mixtral-Upperbound")
train_data = ds['train']
test_data = ds['test']
```
# Mixtral-Upperbound 数据集
本数据集用于多长度Token(Multi-Length Token, MLT)的模型训练。
## 更新日志
- **新增MetaMathQA-40K**:新增全量MATH_*与GSM_*类型数据
- **新增GPQA**:转换为lm-eval格式的GPQA数据集
- **移除MetaMath**:为避免基准测试集污染,移除原有MetaMath样本
- **LMSYS_Chat 数据精炼**:移除Alpaca提示词模板
## 数据集统计信息
### 数据划分
| 划分 | 样本数量 |
|-------|---------|
| 训练集(train) | 525,948 |
| 测试集(test) | 2,000 |
### 训练集来源分布
| 数据来源 | 样本数量 |
|--------|-------|
| LMSYS_Chat | 178,700 |
| MMLU | 99,482 |
| UltraFeedback | 59,837 |
| Winogrande | 40,285 |
| HellaSwag | 39,707 |
| Tulu3_IF | 28,652 |
| MetaMathQA_GSM | 24,180 |
| PIQA | 16,058 |
| MetaMathQA_MATH | 15,722 |
| NoRobots | 9,356 |
| GSM8K | 7,422 |
| GPQA | 2,373 |
| ARC_Test_Easy | 2,238 |
| ARC_Train | 1,111 |
| LIMA | 825 |
### 训练集MLT分布
| MLT分组 | 样本数量 |
|-----|-------|
| [MLT:5] | 76,966 |
| [MLT:10] | 56,716 |
| [MLT:30] | 71,821 |
| [MLT:50] | 34,321 |
| [MLT:80] | 28,703 |
| [MLT:150] | 57,437 |
| [MLT:300] | 86,280 |
| [MLT:500] | 72,569 |
| [MLT:700] | 32,929 |
| [MLT:800] | 8,206 |
### 测试集MLT分布
| MLT分组 | 样本数量 |
|-----|-------|
| [MLT:5] | 200 |
| [MLT:10] | 200 |
| [MLT:30] | 200 |
| [MLT:50] | 200 |
| [MLT:80] | 200 |
| [MLT:150] | 200 |
| [MLT:300] | 200 |
| [MLT:500] | 200 |
| [MLT:700] | 200 |
| [MLT:800] | 200 |
## MLT 范围界定
| MLT值 | Token 长度范围 |
|-----------|-------------|
| 5 | 1-5 |
| 10 | 6-10 |
| 30 | 20-30 |
| 50 | 40-50 |
| 80 | 60-80 |
| 150 | 130-150 |
| 300 | 200-300 |
| 500 | 400-500 |
| 700 | 500-650 |
| 800 | 651-800 |
## 使用方法
python
from datasets import load_dataset
ds = load_dataset("Korea-MES/Mixtral-Upperbound")
train_data = ds['train']
test_data = ds['test']
提供机构:
Korea-MES



