Korea-MES/Dolci-Multiturn-MLT

Name: Korea-MES/Dolci-Multiturn-MLT
Creator: Korea-MES
Published: 2025-11-30 20:21:26
License: 暂无描述

Hugging Face2025-11-30 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/Korea-MES/Dolci-Multiturn-MLT

下载链接

链接失效反馈

官方服务：

资源简介：

# Dolci-Instruct-SFT Multi-turn MLT Dataset ## Overview - **Total Samples**: 1,665,239 - **Train Samples**: 1,661,837 (멀티턴) - **Test Samples**: 2,000 (단일턴 - 마지막 턴만 추출) - **Total Assistant Messages**: 1,668,023 - **MLT Labels**: 10 ## Data Format (Train & Test 동일 스키마) ```json { "id": "sample_id", "messages": [ {"role": "user", "content": "질문"}, {"role": "assistant", "content": "답변"} ], "mlt": ["[MLT:50]"], "source": "dolci-instruct-sft" } ``` - **Train**: 멀티턴 (messages에 여러 턴, mlt도 여러 개) - **Test**: 단일턴 (마지막 user-assistant 쌍만, mlt 1개) ## Turn Distribution (Train) | Turns | Samples | Percentage | |-------|---------|------------| | 1 | 1,662,788 | 99.9% | | 2 | 2,238 | 0.1% | | 3 | 183 | 0.0% | | 4 | 3 | 0.0% | | 5 | 3 | 0.0% | | 6 | 7 | 0.0% | | 7 | 9 | 0.0% | | 8 | 3 | 0.0% | | 9 | 2 | 0.0% | | 10 | 2 | 0.0% | | 16 | 1 | 0.0% | ## MLT Distribution (All Assistant Messages) | MLT Label | Count | Percentage | |-----------|-------|------------| | [MLT:5] | 95,950 | 5.8% | | [MLT:10] | 93,801 | 5.6% | | [MLT:30] | 103,504 | 6.2% | | [MLT:50] | 83,391 | 5.0% | | [MLT:80] | 141,501 | 8.5% | | [MLT:150] | 253,847 | 15.2% | | [MLT:300] | 337,430 | 20.2% | | [MLT:500] | 309,736 | 18.6% | | [MLT:700] | 181,401 | 10.9% | | [MLT:800] | 67,462 | 4.0% | ## Test Set MLT Distribution (단일턴, MLT별 200개) | MLT Label | Count | |-----------|-------| | [MLT:5] | 200 | | [MLT:10] | 200 | | [MLT:30] | 200 | | [MLT:50] | 200 | | [MLT:80] | 200 | | [MLT:150] | 200 | | [MLT:300] | 200 | | [MLT:500] | 200 | | [MLT:700] | 200 | | [MLT:800] | 200 | ## Source Distribution | Source | Count | Percentage | |--------|-------|------------| | allenai/olmo-3-instruct-sft-no-tools-final-tagged-topic-ade-keyword-filtered-no-wildchat-reasoning | 1,118,135 | 67.1% | | allenai/olmo-3-instruct-tagged-wildchat-only-topic-filtered | 222,194 | 13.3% | | allenai/verifiable-reasoning-v3-o4-mini-length-filtered-verified_tmp_ids-nol-flt-cleaned | 153,934 | 9.2% | | allenai/verifiable-reasoning-filtered-o4-mini-filtered_tmp_ids-nol-flt-cleaned | 151,108 | 9.1% | | None | 19,808 | 1.2% | | allenai/hardcoded-olmo | 60 | 0.0% | ## Usage ### Train (멀티턴 학습) ```python from datasets import load_dataset ds = load_dataset("Korea-MES/Dolci-Multiturn-MLT") train_sample = ds['train'][0] print(train_sample['messages']) # 멀티턴 대화 print(train_sample['mlt']) # MLT 라벨 리스트 ``` ### Test (단일턴 평가) ```python test_sample = ds['test'][0] print(test_sample['messages']) # 단일턴 (마지막 턴만) print(test_sample['mlt']) # MLT 라벨 리스트 (1개) ``` ## Training Format 각 assistant 응답 앞에 해당 MLT 토큰을 붙여서 학습: ```python def format_multiturn(example): messages = example["messages"] mlt_list = example["mlt"] mlt_idx = 0 new_messages = [] for msg in messages: if msg["role"] == "assistant": new_content = f"{mlt_list[mlt_idx]}{msg['content']}" new_messages.append({"role": "assistant", "content": new_content}) mlt_idx += 1 else: new_messages.append(msg) return {"messages": new_messages} ``` ## Source - **Base Dataset**: allenai/Dolci-Instruct-SFT-7B - **Filtering**: Total tokens ≤ 2048 (apply_chat_template 기준) - **MLT Range**: 1-800 tokens per assistant response Generated automatically.

提供机构：

Korea-MES

5,000+

优质数据集

54 个

任务类型

进入经典数据集