Korea-MES/Tulu3-Multiturn-MLT
收藏Hugging Face2025-11-30 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Korea-MES/Tulu3-Multiturn-MLT
下载链接
链接失效反馈官方服务:
资源简介:
# Tulu-3 SFT Mixture Multi-turn MLT Dataset
## Overview
- **Total Samples**: 743,472
- **Train Samples**: 741,463 (멀티턴)
- **Test Samples**: 2,000 (단일턴 - 마지막 턴만 추출)
- **Total Assistant Messages**: 781,924
- **MLT Labels**: 10
## Data Format (Train & Test 동일 스키마)
```json
{
"id": "sample_id",
"messages": [
{"role": "user", "content": "질문"},
{"role": "assistant", "content": "답변"}
],
"mlt": ["[MLT:50]"],
"source": "tulu-3-sft-mixture"
}
```
- **Train**: 멀티턴 (messages에 여러 턴, mlt도 여러 개)
- **Test**: 단일턴 (마지막 user-assistant 쌍만, mlt 1개)
## Turn Distribution (Train)
| Turns | Samples | Percentage |
|-------|---------|------------|
| 1 | 723,105 | 97.3% |
| 2 | 12,212 | 1.6% |
| 3 | 4,681 | 0.6% |
| 4 | 1,586 | 0.2% |
| 5 | 693 | 0.1% |
| 6 | 399 | 0.1% |
| 7 | 233 | 0.0% |
| 8 | 147 | 0.0% |
| 9 | 100 | 0.0% |
| 10 | 76 | 0.0% |
| 11 | 39 | 0.0% |
| 12 | 51 | 0.0% |
| 13 | 22 | 0.0% |
| 14 | 28 | 0.0% |
| 15 | 24 | 0.0% |
| 16 | 12 | 0.0% |
| 17 | 8 | 0.0% |
| 18 | 8 | 0.0% |
| 19 | 8 | 0.0% |
| 20 | 10 | 0.0% |
| 21 | 9 | 0.0% |
| 22 | 5 | 0.0% |
| 23 | 2 | 0.0% |
| 24 | 1 | 0.0% |
| 25 | 2 | 0.0% |
| 26 | 2 | 0.0% |
| 27 | 1 | 0.0% |
| 28 | 1 | 0.0% |
| 29 | 1 | 0.0% |
| 31 | 2 | 0.0% |
| 34 | 1 | 0.0% |
| 38 | 1 | 0.0% |
| 43 | 1 | 0.0% |
| 47 | 1 | 0.0% |
## MLT Distribution (All Assistant Messages)
| MLT Label | Count | Percentage |
|-----------|-------|------------|
| [MLT:5] | 35,455 | 4.5% |
| [MLT:10] | 18,178 | 2.3% |
| [MLT:30] | 77,415 | 9.9% |
| [MLT:50] | 43,781 | 5.6% |
| [MLT:80] | 67,159 | 8.6% |
| [MLT:150] | 107,910 | 13.8% |
| [MLT:300] | 146,079 | 18.7% |
| [MLT:500] | 156,447 | 20.0% |
| [MLT:700] | 91,539 | 11.7% |
| [MLT:800] | 37,961 | 4.9% |
## Test Set MLT Distribution (단일턴, MLT별 200개)
| MLT Label | Count |
|-----------|-------|
| [MLT:5] | 200 |
| [MLT:10] | 200 |
| [MLT:30] | 200 |
| [MLT:50] | 200 |
| [MLT:80] | 200 |
| [MLT:150] | 200 |
| [MLT:300] | 200 |
| [MLT:500] | 200 |
| [MLT:700] | 200 |
| [MLT:800] | 200 |
## Source Distribution
| Source | Count | Percentage |
|--------|-------|------------|
| ai2-adapt-dev/evol_codealpaca_heval_decontaminated | 103,761 | 14.0% |
| ai2-adapt-dev/flan_v2_converted | 89,605 | 12.1% |
| ai2-adapt-dev/tulu_v3.9_aya_100k | 88,859 | 12.0% |
| ai2-adapt-dev/tulu_v3.9_wildchat_100k | 61,181 | 8.2% |
| ai2-adapt-dev/tulu_v3.9_wildjailbreak_decontaminated_50k | 49,982 | 6.7% |
| ai2-adapt-dev/tulu_v3.9_open_math_2_gsm8k_50k | 49,938 | 6.7% |
| allenai/tulu-3-sft-personas-math-grade | 49,665 | 6.7% |
| ai2-adapt-dev/tulu_v3.9_synthetic_finalresp_wildguardmixtrain_decontaminated_50k | 49,275 | 6.6% |
| ai2-adapt-dev/numinamath_tir_math_decontaminated | 48,975 | 6.6% |
| ai2-adapt-dev/personahub_math_v5_regen_149960 | 40,162 | 5.4% |
| ai2-adapt-dev/personahub_code_v2_34999 | 34,986 | 4.7% |
| ai2-adapt-dev/personahub_ifdata_manual_seed_v3_29980 | 28,776 | 3.9% |
| ai2-adapt-dev/coconot_converted | 10,981 | 1.5% |
| ai2-adapt-dev/no_robots_converted | 9,453 | 1.3% |
| ai2-adapt-dev/tulu_v3.9_personahub_math_interm_algebra_20k | 8,233 | 1.1% |
| ai2-adapt-dev/tulu_v3.9_sciriff_10k | 7,922 | 1.1% |
| ai2-adapt-dev/oasst1_converted | 6,971 | 0.9% |
| ai2-adapt-dev/tulu_v3.9_table_gpt_5k | 4,507 | 0.6% |
| ai2-adapt-dev/tulu_hard_coded_repeated_10 | 240 | 0.0% |
## Usage
### Train (멀티턴 학습)
```python
from datasets import load_dataset
ds = load_dataset("Korea-MES/Tulu3-Multiturn-MLT")
train_sample = ds['train'][0]
print(train_sample['messages']) # 멀티턴 대화
print(train_sample['mlt']) # MLT 라벨 리스트
```
### Test (단일턴 평가)
```python
test_sample = ds['test'][0]
print(test_sample['messages']) # 단일턴 (마지막 턴만)
print(test_sample['mlt']) # MLT 라벨 리스트 (1개)
```
## Training Format
각 assistant 응답 앞에 해당 MLT 토큰을 붙여서 학습:
```python
def format_multiturn(example):
messages = example["messages"]
mlt_list = example["mlt"]
mlt_idx = 0
new_messages = []
for msg in messages:
if msg["role"] == "assistant":
new_content = f"{mlt_list[mlt_idx]}{msg['content']}"
new_messages.append({"role": "assistant", "content": new_content})
mlt_idx += 1
else:
new_messages.append(msg)
return {"messages": new_messages}
```
## Source
- **Base Dataset**: allenai/tulu-3-sft-mixture
- **Filtering**: Total tokens ≤ 2048 (apply_chat_template 기준)
- **MLT Range**: 1-800 tokens per assistant response
Generated automatically.
提供机构:
Korea-MES



