five

Korea-MES/Tulu3-Multiturn-MLT

收藏
Hugging Face2025-11-30 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Korea-MES/Tulu3-Multiturn-MLT
下载链接
链接失效反馈
官方服务:
资源简介:
# Tulu-3 SFT Mixture Multi-turn MLT Dataset ## Overview - **Total Samples**: 743,472 - **Train Samples**: 741,463 (멀티턴) - **Test Samples**: 2,000 (단일턴 - 마지막 턴만 추출) - **Total Assistant Messages**: 781,924 - **MLT Labels**: 10 ## Data Format (Train & Test 동일 스키마) ```json { "id": "sample_id", "messages": [ {"role": "user", "content": "질문"}, {"role": "assistant", "content": "답변"} ], "mlt": ["[MLT:50]"], "source": "tulu-3-sft-mixture" } ``` - **Train**: 멀티턴 (messages에 여러 턴, mlt도 여러 개) - **Test**: 단일턴 (마지막 user-assistant 쌍만, mlt 1개) ## Turn Distribution (Train) | Turns | Samples | Percentage | |-------|---------|------------| | 1 | 723,105 | 97.3% | | 2 | 12,212 | 1.6% | | 3 | 4,681 | 0.6% | | 4 | 1,586 | 0.2% | | 5 | 693 | 0.1% | | 6 | 399 | 0.1% | | 7 | 233 | 0.0% | | 8 | 147 | 0.0% | | 9 | 100 | 0.0% | | 10 | 76 | 0.0% | | 11 | 39 | 0.0% | | 12 | 51 | 0.0% | | 13 | 22 | 0.0% | | 14 | 28 | 0.0% | | 15 | 24 | 0.0% | | 16 | 12 | 0.0% | | 17 | 8 | 0.0% | | 18 | 8 | 0.0% | | 19 | 8 | 0.0% | | 20 | 10 | 0.0% | | 21 | 9 | 0.0% | | 22 | 5 | 0.0% | | 23 | 2 | 0.0% | | 24 | 1 | 0.0% | | 25 | 2 | 0.0% | | 26 | 2 | 0.0% | | 27 | 1 | 0.0% | | 28 | 1 | 0.0% | | 29 | 1 | 0.0% | | 31 | 2 | 0.0% | | 34 | 1 | 0.0% | | 38 | 1 | 0.0% | | 43 | 1 | 0.0% | | 47 | 1 | 0.0% | ## MLT Distribution (All Assistant Messages) | MLT Label | Count | Percentage | |-----------|-------|------------| | [MLT:5] | 35,455 | 4.5% | | [MLT:10] | 18,178 | 2.3% | | [MLT:30] | 77,415 | 9.9% | | [MLT:50] | 43,781 | 5.6% | | [MLT:80] | 67,159 | 8.6% | | [MLT:150] | 107,910 | 13.8% | | [MLT:300] | 146,079 | 18.7% | | [MLT:500] | 156,447 | 20.0% | | [MLT:700] | 91,539 | 11.7% | | [MLT:800] | 37,961 | 4.9% | ## Test Set MLT Distribution (단일턴, MLT별 200개) | MLT Label | Count | |-----------|-------| | [MLT:5] | 200 | | [MLT:10] | 200 | | [MLT:30] | 200 | | [MLT:50] | 200 | | [MLT:80] | 200 | | [MLT:150] | 200 | | [MLT:300] | 200 | | [MLT:500] | 200 | | [MLT:700] | 200 | | [MLT:800] | 200 | ## Source Distribution | Source | Count | Percentage | |--------|-------|------------| | ai2-adapt-dev/evol_codealpaca_heval_decontaminated | 103,761 | 14.0% | | ai2-adapt-dev/flan_v2_converted | 89,605 | 12.1% | | ai2-adapt-dev/tulu_v3.9_aya_100k | 88,859 | 12.0% | | ai2-adapt-dev/tulu_v3.9_wildchat_100k | 61,181 | 8.2% | | ai2-adapt-dev/tulu_v3.9_wildjailbreak_decontaminated_50k | 49,982 | 6.7% | | ai2-adapt-dev/tulu_v3.9_open_math_2_gsm8k_50k | 49,938 | 6.7% | | allenai/tulu-3-sft-personas-math-grade | 49,665 | 6.7% | | ai2-adapt-dev/tulu_v3.9_synthetic_finalresp_wildguardmixtrain_decontaminated_50k | 49,275 | 6.6% | | ai2-adapt-dev/numinamath_tir_math_decontaminated | 48,975 | 6.6% | | ai2-adapt-dev/personahub_math_v5_regen_149960 | 40,162 | 5.4% | | ai2-adapt-dev/personahub_code_v2_34999 | 34,986 | 4.7% | | ai2-adapt-dev/personahub_ifdata_manual_seed_v3_29980 | 28,776 | 3.9% | | ai2-adapt-dev/coconot_converted | 10,981 | 1.5% | | ai2-adapt-dev/no_robots_converted | 9,453 | 1.3% | | ai2-adapt-dev/tulu_v3.9_personahub_math_interm_algebra_20k | 8,233 | 1.1% | | ai2-adapt-dev/tulu_v3.9_sciriff_10k | 7,922 | 1.1% | | ai2-adapt-dev/oasst1_converted | 6,971 | 0.9% | | ai2-adapt-dev/tulu_v3.9_table_gpt_5k | 4,507 | 0.6% | | ai2-adapt-dev/tulu_hard_coded_repeated_10 | 240 | 0.0% | ## Usage ### Train (멀티턴 학습) ```python from datasets import load_dataset ds = load_dataset("Korea-MES/Tulu3-Multiturn-MLT") train_sample = ds['train'][0] print(train_sample['messages']) # 멀티턴 대화 print(train_sample['mlt']) # MLT 라벨 리스트 ``` ### Test (단일턴 평가) ```python test_sample = ds['test'][0] print(test_sample['messages']) # 단일턴 (마지막 턴만) print(test_sample['mlt']) # MLT 라벨 리스트 (1개) ``` ## Training Format 각 assistant 응답 앞에 해당 MLT 토큰을 붙여서 학습: ```python def format_multiturn(example): messages = example["messages"] mlt_list = example["mlt"] mlt_idx = 0 new_messages = [] for msg in messages: if msg["role"] == "assistant": new_content = f"{mlt_list[mlt_idx]}{msg['content']}" new_messages.append({"role": "assistant", "content": new_content}) mlt_idx += 1 else: new_messages.append(msg) return {"messages": new_messages} ``` ## Source - **Base Dataset**: allenai/tulu-3-sft-mixture - **Filtering**: Total tokens ≤ 2048 (apply_chat_template 기준) - **MLT Range**: 1-800 tokens per assistant response Generated automatically.
提供机构:
Korea-MES
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作