r-three/tulu3-sft-clustered8-merged

Name: r-three/tulu3-sft-clustered8-merged
Creator: r-three
Published: 2025-12-10 13:07:08
License: 暂无描述

Hugging Face2025-12-10 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/r-three/tulu3-sft-clustered8-merged

下载链接

链接失效反馈

官方服务：

资源简介：

# Tülu-3 SFT Mixture — 8-Clustered (Merged) Version This repository contains 8 specialized domains, automatically extracted and curated from the Tulu-3 SFT mixture (`allenai/tulu-3-sft-mixture`) using advanced clustering techniques performed by `Malikeh1375` (original version found in `Malikeh1375/clustered_tulu_3_8`). For each cluster, the original `train` and `test` splits have been **merged** into a single split — so that training covers the **full cluster data**. Split/cluster names are consistent with subset names from `Malikeh1375/clustered_tulu_3_8`. **Disclaimer:** All examples are preserved exactly as in the original clustered dataset. No text content or metadata was modified — only split assignment (merged `/train+test`) and split naming were adjusted. --- ## Dataset Card - **Source base dataset:** `Malikeh1375/clustered_tulu_3_8` - **This repo:** merged version with one split per cluster - **Format:** HuggingFace `datasets` (Arrow/Parquet); conversational examples in `messages` format --- ## Splits / Clusters & Sizes Each split corresponds to one cluster from the original dataset (split names used here are sanitized (hyphens replaced with underscores) for Hub compatibility). | Split name | Total examples (train + test) | |---|---| | `programming_and_code_development` | 110,979 | | `qanda_and_logical_reasoning` | 94,029 | | `creative_writing_and_general_tasks`| 118,503 | | `multilingual_and_translation` | 87,996 | | `safety_and_harmful_content` | 127,623 | | `word_problems_and_arithmetic` | 136,114 | | `non_english_mathematics` | 118,726 | | `advanced_mathematics_and_modeling` | 145,333 | **Total (all clusters combined): 939,303** examples. Each example retains all original fields: e.g. `messages`, `source`, `original_id`, etc. --- ## Usage ```python from datasets import load_dataset base = "r-three/tulu3-sft-clustered8-merged" # Load one cluster ds = load_dataset(base, split="non_english_mathematics") print(len(ds)) # e.g. 118726 print(ds[0].keys()) for turn in ds[0]["messages"]: print(turn["role"], ":", turn["content"])

提供机构：

r-three

5,000+

优质数据集

54 个

任务类型

进入经典数据集