five

r-three/tulu3-sft-clustered8-merged

收藏
Hugging Face2025-12-10 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/r-three/tulu3-sft-clustered8-merged
下载链接
链接失效反馈
官方服务:
资源简介:
# Tülu-3 SFT Mixture — 8-Clustered (Merged) Version This repository contains 8 specialized domains, automatically extracted and curated from the Tulu-3 SFT mixture (`allenai/tulu-3-sft-mixture`) using advanced clustering techniques performed by `Malikeh1375` (original version found in `Malikeh1375/clustered_tulu_3_8`). For each cluster, the original `train` and `test` splits have been **merged** into a single split — so that training covers the **full cluster data**. Split/cluster names are consistent with subset names from `Malikeh1375/clustered_tulu_3_8`. **Disclaimer:** All examples are preserved exactly as in the original clustered dataset. No text content or metadata was modified — only split assignment (merged `/train+test`) and split naming were adjusted. --- ## Dataset Card - **Source base dataset:** `Malikeh1375/clustered_tulu_3_8` - **This repo:** merged version with one split per cluster - **Format:** HuggingFace `datasets` (Arrow/Parquet); conversational examples in `messages` format --- ## Splits / Clusters & Sizes Each split corresponds to one cluster from the original dataset (split names used here are sanitized (hyphens replaced with underscores) for Hub compatibility). | Split name | Total examples (train + test) | |---|---| | `programming_and_code_development` | 110,979 | | `qanda_and_logical_reasoning` | 94,029 | | `creative_writing_and_general_tasks`| 118,503 | | `multilingual_and_translation` | 87,996 | | `safety_and_harmful_content` | 127,623 | | `word_problems_and_arithmetic` | 136,114 | | `non_english_mathematics` | 118,726 | | `advanced_mathematics_and_modeling` | 145,333 | **Total (all clusters combined): 939,303** examples. Each example retains all original fields: e.g. `messages`, `source`, `original_id`, etc. --- ## Usage ```python from datasets import load_dataset base = "r-three/tulu3-sft-clustered8-merged" # Load one cluster ds = load_dataset(base, split="non_english_mathematics") print(len(ds)) # e.g. 118726 print(ds[0].keys()) for turn in ds[0]["messages"]: print(turn["role"], ":", turn["content"])
提供机构:
r-three
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作