r-three/tulu3-sft-clustered8-merged
收藏Hugging Face2025-12-10 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/r-three/tulu3-sft-clustered8-merged
下载链接
链接失效反馈官方服务:
资源简介:
# Tülu-3 SFT Mixture — 8-Clustered (Merged) Version
This repository contains 8 specialized domains, automatically extracted and curated from the Tulu-3 SFT mixture (`allenai/tulu-3-sft-mixture`) using advanced clustering techniques performed by `Malikeh1375` (original version found in `Malikeh1375/clustered_tulu_3_8`).
For each cluster, the original `train` and `test` splits have been **merged** into a single split — so that training covers the **full cluster data**.
Split/cluster names are consistent with subset names from `Malikeh1375/clustered_tulu_3_8`.
**Disclaimer:** All examples are preserved exactly as in the original clustered dataset.
No text content or metadata was modified — only split assignment (merged `/train+test`) and split naming were adjusted.
---
## Dataset Card
- **Source base dataset:** `Malikeh1375/clustered_tulu_3_8`
- **This repo:** merged version with one split per cluster
- **Format:** HuggingFace `datasets` (Arrow/Parquet); conversational examples in `messages` format
---
## Splits / Clusters & Sizes
Each split corresponds to one cluster from the original dataset (split names used here are sanitized (hyphens replaced with underscores) for Hub compatibility).
| Split name | Total examples (train + test) |
|---|---|
| `programming_and_code_development` | 110,979 |
| `qanda_and_logical_reasoning` | 94,029 |
| `creative_writing_and_general_tasks`| 118,503 |
| `multilingual_and_translation` | 87,996 |
| `safety_and_harmful_content` | 127,623 |
| `word_problems_and_arithmetic` | 136,114 |
| `non_english_mathematics` | 118,726 |
| `advanced_mathematics_and_modeling` | 145,333 |
**Total (all clusters combined): 939,303** examples.
Each example retains all original fields: e.g. `messages`, `source`, `original_id`, etc.
---
## Usage
```python
from datasets import load_dataset
base = "r-three/tulu3-sft-clustered8-merged"
# Load one cluster
ds = load_dataset(base, split="non_english_mathematics")
print(len(ds)) # e.g. 118726
print(ds[0].keys())
for turn in ds[0]["messages"]:
print(turn["role"], ":", turn["content"])
提供机构:
r-three



