r-three/tulu3-sft-random7-seed123
收藏Hugging Face2025-12-09 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/r-three/tulu3-sft-random7-seed123
下载链接
链接失效反馈官方服务:
资源简介:
# Tülu-3 SFT Mixture — Random 7-Cluster Baseline
This is a **randomly clustered version** of the `allenai/tulu-3-sft-mixture` dataset, where each example is assigned uniformly at random to one of **7 clusters**.
Unlike the original category-based clustering, this split contains **no semantic or source-based logic** — it serves as a **baseline** for evaluating whether principled clustering methods outperform naive random partitioning.
A fixed random seed ensures that the split is **stable and reproducible**.
**Disclaimer**: All examples are preserved exactly as in the original Tülu-3 SFT mixture.
Only an additional `random_cluster` indicator was added, and items were partitioned by split.
No text, labels, or metadata were changed.
---
## Dataset Card
- **Source base dataset:** `allenai/tulu-3-sft-mixture`
- **This repo:** 7-way random clustering with reproducible seed
- **Format:** HuggingFace `datasets` (Arrow/Parquet); conversational examples in `messages` format
---
## Splits / Random Clusters
Each split corresponds to one of seven uniform random clusters:
- `random0`
- `random1`
- `random2`
- `random3`
- `random4`
- `random5`
- `random6`
Each example includes a `random_cluster` field in the range `0–6` corresponding to its assigned split.
---
## Sizes
Each random cluster contains approximately the same number of examples:
- `random0` 134,000 examples
- `random1` 134,000 examples
- `random2` 134,000 examples
- `random3` 134,000 examples
- `random4` 134,000 examples
- `random5` 134,000 examples
- `random6` 134,000 examples
**Total:** 939,343 examples (identical to the original Tülu-3 SFT mixture)
---
## Usage
```python
from datasets import load_dataset
base = "r-three/tulu3-sft-random7"
# Load a random cluster
ds = load_dataset(base, split="random3")
# Inspect an example
ex = ds[0]
print(ex["random_cluster"])
for turn in ex["messages"]:
print(turn["role"], ":", turn["content"])
```
提供机构:
r-three



