Korea-MES/Dolci-Multiturn-MLT
收藏Hugging Face2025-11-30 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Korea-MES/Dolci-Multiturn-MLT
下载链接
链接失效反馈官方服务:
资源简介:
# Dolci-Instruct-SFT Multi-turn MLT Dataset
## Overview
- **Total Samples**: 1,665,239
- **Train Samples**: 1,661,837 (멀티턴)
- **Test Samples**: 2,000 (단일턴 - 마지막 턴만 추출)
- **Total Assistant Messages**: 1,668,023
- **MLT Labels**: 10
## Data Format (Train & Test 동일 스키마)
```json
{
"id": "sample_id",
"messages": [
{"role": "user", "content": "질문"},
{"role": "assistant", "content": "답변"}
],
"mlt": ["[MLT:50]"],
"source": "dolci-instruct-sft"
}
```
- **Train**: 멀티턴 (messages에 여러 턴, mlt도 여러 개)
- **Test**: 단일턴 (마지막 user-assistant 쌍만, mlt 1개)
## Turn Distribution (Train)
| Turns | Samples | Percentage |
|-------|---------|------------|
| 1 | 1,662,788 | 99.9% |
| 2 | 2,238 | 0.1% |
| 3 | 183 | 0.0% |
| 4 | 3 | 0.0% |
| 5 | 3 | 0.0% |
| 6 | 7 | 0.0% |
| 7 | 9 | 0.0% |
| 8 | 3 | 0.0% |
| 9 | 2 | 0.0% |
| 10 | 2 | 0.0% |
| 16 | 1 | 0.0% |
## MLT Distribution (All Assistant Messages)
| MLT Label | Count | Percentage |
|-----------|-------|------------|
| [MLT:5] | 95,950 | 5.8% |
| [MLT:10] | 93,801 | 5.6% |
| [MLT:30] | 103,504 | 6.2% |
| [MLT:50] | 83,391 | 5.0% |
| [MLT:80] | 141,501 | 8.5% |
| [MLT:150] | 253,847 | 15.2% |
| [MLT:300] | 337,430 | 20.2% |
| [MLT:500] | 309,736 | 18.6% |
| [MLT:700] | 181,401 | 10.9% |
| [MLT:800] | 67,462 | 4.0% |
## Test Set MLT Distribution (단일턴, MLT별 200개)
| MLT Label | Count |
|-----------|-------|
| [MLT:5] | 200 |
| [MLT:10] | 200 |
| [MLT:30] | 200 |
| [MLT:50] | 200 |
| [MLT:80] | 200 |
| [MLT:150] | 200 |
| [MLT:300] | 200 |
| [MLT:500] | 200 |
| [MLT:700] | 200 |
| [MLT:800] | 200 |
## Source Distribution
| Source | Count | Percentage |
|--------|-------|------------|
| allenai/olmo-3-instruct-sft-no-tools-final-tagged-topic-ade-keyword-filtered-no-wildchat-reasoning | 1,118,135 | 67.1% |
| allenai/olmo-3-instruct-tagged-wildchat-only-topic-filtered | 222,194 | 13.3% |
| allenai/verifiable-reasoning-v3-o4-mini-length-filtered-verified_tmp_ids-nol-flt-cleaned | 153,934 | 9.2% |
| allenai/verifiable-reasoning-filtered-o4-mini-filtered_tmp_ids-nol-flt-cleaned | 151,108 | 9.1% |
| None | 19,808 | 1.2% |
| allenai/hardcoded-olmo | 60 | 0.0% |
## Usage
### Train (멀티턴 학습)
```python
from datasets import load_dataset
ds = load_dataset("Korea-MES/Dolci-Multiturn-MLT")
train_sample = ds['train'][0]
print(train_sample['messages']) # 멀티턴 대화
print(train_sample['mlt']) # MLT 라벨 리스트
```
### Test (단일턴 평가)
```python
test_sample = ds['test'][0]
print(test_sample['messages']) # 단일턴 (마지막 턴만)
print(test_sample['mlt']) # MLT 라벨 리스트 (1개)
```
## Training Format
각 assistant 응답 앞에 해당 MLT 토큰을 붙여서 학습:
```python
def format_multiturn(example):
messages = example["messages"]
mlt_list = example["mlt"]
mlt_idx = 0
new_messages = []
for msg in messages:
if msg["role"] == "assistant":
new_content = f"{mlt_list[mlt_idx]}{msg['content']}"
new_messages.append({"role": "assistant", "content": new_content})
mlt_idx += 1
else:
new_messages.append(msg)
return {"messages": new_messages}
```
## Source
- **Base Dataset**: allenai/Dolci-Instruct-SFT-7B
- **Filtering**: Total tokens ≤ 2048 (apply_chat_template 기준)
- **MLT Range**: 1-800 tokens per assistant response
Generated automatically.
提供机构:
Korea-MES



