r-three/tulu3-sft-og-clustering

Name: r-three/tulu3-sft-og-clustering
Creator: r-three
Published: 2025-12-09 17:17:46
License: 暂无描述

Hugging Face2025-12-09 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/r-three/tulu3-sft-og-clustering

下载链接

链接失效反馈

官方服务：

资源简介：

# Tülu-3 SFT Mixture — Original Category Clustering This is a **clustered version** of the `allenai/tulu-3-sft-mixture` dataset, where each example is assigned to one of the original Tülu-3 paper categories based on its `source` field. Each source dataset was mapped to exactly one category based on the definitions in the Tülu-3 paper. No heuristic text-based clustering was used. The goal is to make it easy to train **per-category experts** or do **category-aware sampling** for instruction tuning and model merging experiments. **Disclaimer**: All examples are preserved exactly as in the original Tülu-3 SFT mixture. Only an additional `category` column was added and items were partitioned by split. No text, labels, or metadata were changed. --- ## Dataset Card - **Source base dataset:** [`allenai/tulu-3-sft-mixture`](https://huggingface.co/datasets/allenai/tulu-3-sft-mixture) - **This repo:** clustered version with multiple splits - **Format:** HuggingFace `datasets` (Arrow/Parquet); conversational examples in `messages` format --- ## Splits / Categories Each split corresponds to a Tülu-3 paper category: - **`general`** - `ai2-adapt-dev/tulu_hard_coded_repeated_10` (Tülu 3 Hardcoded) - `ai2-adapt-dev/oasst1_converted` (OpenAssistant Guanaco) - `ai2-adapt-dev/no_robots_converted` (No Robots) - `ai2-adapt-dev/tulu_v3.9_wildchat_100k` (WildChat GPT-4 subset) - **`knowledge_recall`** - `ai2-adapt-dev/flan_v2_converted` (FLAN v2) - `ai2-adapt-dev/tulu_v3.9_sciriff_10k` (SciRIFF) - `ai2-adapt-dev/tulu_v3.9_table_gpt_5k` (TableGPT) - **`math_reasoning`** - `ai2-adapt-dev/personahub_math_v5_regen_149960` (Tülu 3 Persona MATH) - `ai2-adapt-dev/tulu_v3.9_personahub_math_interm_algebra_20k` (Tülu 3 Persona Algebra) - `ai2-adapt-dev/tulu_v3.9_open_math_2_gsm8k_50k` (OpenMathInstruct2 + GSM8K blend) - `ai2-adapt-dev/numinamath_tir_math_decontaminated` (NuminaMath-TIR) - `allenai/tulu-3-sft-personas-math-grade` (Grade-school math personas) - **`coding`** - `ai2-adapt-dev/personahub_code_v2_34999` (Tülu 3 Persona Python) - `ai2-adapt-dev/evol_codealpaca_heval_decontaminated` (Evol CodeAlpaca) - **`safety_noncompliance`** - `ai2-adapt-dev/coconot_converted` (CoCoNot) - `ai2-adapt-dev/tulu_v3.9_synthetic_finalresp_wildguardmixtrain_decontaminated_50k` (WildGuardMix) - `ai2-adapt-dev/tulu_v3.9_wildjailbreak_decontaminated_50k` (WildJailbreak) - **`multilingual`** - `ai2-adapt-dev/tulu_v3.9_aya_100k` (Aya) - **`precise_if`** - `ai2-adapt-dev/personahub_ifdata_manual_seed_v3_29980` (Tülu 3 Persona IF) Each example includes a `category` column corresponding to the split. --- ## Sizes From the clustering script: - `general` 116,871 examples - `knowledge_recall` 104,982 examples - `math_reasoning` 334,252 examples - `coding` 142,275 examples - `safety_noncompliance` 110,983 examples - `multilingual` 100,000 examples - `precise_if` 29,980 examples Total: 939,343 examples (same as original Tülu-3 SFT mixture). --- ## Usage ```python from datasets import load_dataset base = "r-three/tulu3-sft-og-clustering" # Load a specific category math_ds = load_dataset(base, split="math_reasoning") # Iterate over messages ex = math_ds[0] print(ex["source"], ex["category"]) for turn in ex["messages"]: print(turn["role"], ":", turn["content"])

提供机构：

r-three

5,000+

优质数据集

54 个

任务类型

进入经典数据集