r-three/tulu3-sft-og-clustering
收藏Hugging Face2025-12-09 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/r-three/tulu3-sft-og-clustering
下载链接
链接失效反馈官方服务:
资源简介:
# Tülu-3 SFT Mixture — Original Category Clustering
This is a **clustered version** of the `allenai/tulu-3-sft-mixture` dataset, where each example is assigned to one of the original Tülu-3 paper categories based on its `source` field.
Each source dataset was mapped to exactly one category based on the definitions in the Tülu-3 paper. No heuristic text-based clustering was used.
The goal is to make it easy to train **per-category experts** or do **category-aware sampling** for instruction tuning and model merging experiments.
**Disclaimer**: All examples are preserved exactly as in the original Tülu-3 SFT mixture. Only an additional `category` column was added and items were partitioned by split. No text, labels, or metadata were changed.
---
## Dataset Card
- **Source base dataset:** [`allenai/tulu-3-sft-mixture`](https://huggingface.co/datasets/allenai/tulu-3-sft-mixture)
- **This repo:** clustered version with multiple splits
- **Format:** HuggingFace `datasets` (Arrow/Parquet); conversational examples in `messages` format
---
## Splits / Categories
Each split corresponds to a Tülu-3 paper category:
- **`general`**
- `ai2-adapt-dev/tulu_hard_coded_repeated_10` (Tülu 3 Hardcoded)
- `ai2-adapt-dev/oasst1_converted` (OpenAssistant Guanaco)
- `ai2-adapt-dev/no_robots_converted` (No Robots)
- `ai2-adapt-dev/tulu_v3.9_wildchat_100k` (WildChat GPT-4 subset)
- **`knowledge_recall`**
- `ai2-adapt-dev/flan_v2_converted` (FLAN v2)
- `ai2-adapt-dev/tulu_v3.9_sciriff_10k` (SciRIFF)
- `ai2-adapt-dev/tulu_v3.9_table_gpt_5k` (TableGPT)
- **`math_reasoning`**
- `ai2-adapt-dev/personahub_math_v5_regen_149960` (Tülu 3 Persona MATH)
- `ai2-adapt-dev/tulu_v3.9_personahub_math_interm_algebra_20k` (Tülu 3 Persona Algebra)
- `ai2-adapt-dev/tulu_v3.9_open_math_2_gsm8k_50k` (OpenMathInstruct2 + GSM8K blend)
- `ai2-adapt-dev/numinamath_tir_math_decontaminated` (NuminaMath-TIR)
- `allenai/tulu-3-sft-personas-math-grade` (Grade-school math personas)
- **`coding`**
- `ai2-adapt-dev/personahub_code_v2_34999` (Tülu 3 Persona Python)
- `ai2-adapt-dev/evol_codealpaca_heval_decontaminated` (Evol CodeAlpaca)
- **`safety_noncompliance`**
- `ai2-adapt-dev/coconot_converted` (CoCoNot)
- `ai2-adapt-dev/tulu_v3.9_synthetic_finalresp_wildguardmixtrain_decontaminated_50k` (WildGuardMix)
- `ai2-adapt-dev/tulu_v3.9_wildjailbreak_decontaminated_50k` (WildJailbreak)
- **`multilingual`**
- `ai2-adapt-dev/tulu_v3.9_aya_100k` (Aya)
- **`precise_if`**
- `ai2-adapt-dev/personahub_ifdata_manual_seed_v3_29980` (Tülu 3 Persona IF)
Each example includes a `category` column corresponding to the split.
---
## Sizes
From the clustering script:
- `general` 116,871 examples
- `knowledge_recall` 104,982 examples
- `math_reasoning` 334,252 examples
- `coding` 142,275 examples
- `safety_noncompliance` 110,983 examples
- `multilingual` 100,000 examples
- `precise_if` 29,980 examples
Total: 939,343 examples (same as original Tülu-3 SFT mixture).
---
## Usage
```python
from datasets import load_dataset
base = "r-three/tulu3-sft-og-clustering"
# Load a specific category
math_ds = load_dataset(base, split="math_reasoning")
# Iterate over messages
ex = math_ds[0]
print(ex["source"], ex["category"])
for turn in ex["messages"]:
print(turn["role"], ":", turn["content"])
提供机构:
r-three



