aimlresearch2023/ClimbMix100K
收藏Hugging Face2026-03-06 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/aimlresearch2023/ClimbMix100K
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: cluster_id=1
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 1257710
num_examples: 865
- config_name: cluster_id=2
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 4491344
num_examples: 1208
- config_name: cluster_id=3
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 2287572
num_examples: 1446
- config_name: cluster_id=4
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 3641952
num_examples: 3858
- config_name: cluster_id=5
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 6401821
num_examples: 1889
- config_name: cluster_id=6
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 79414326
num_examples: 17778
- config_name: cluster_id=7
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 61947487
num_examples: 16729
- config_name: cluster_id=8
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 7140873
num_examples: 1167
- config_name: cluster_id=9
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 2405144
num_examples: 812
- config_name: cluster_id=10
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 15108942
num_examples: 7338
- config_name: cluster_id=11
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 4798560
num_examples: 1560
- config_name: cluster_id=12
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 75588012
num_examples: 25684
- config_name: cluster_id=13
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 256736
num_examples: 904
- config_name: cluster_id=14
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 82269
num_examples: 277
- config_name: cluster_id=15
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 1922778
num_examples: 234
- config_name: cluster_id=16
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 24718502
num_examples: 7283
- config_name: cluster_id=17
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 7246704
num_examples: 7022
- config_name: cluster_id=18
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 6249425
num_examples: 2275
- config_name: cluster_id=19
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 3190109
num_examples: 1163
- config_name: cluster_id=20
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 283972
num_examples: 508
configs:
- config_name: cluster_id=1
data_files:
- split: train
path: cluster_id=1/train-*
- config_name: cluster_id=2
data_files:
- split: train
path: cluster_id=2/train-*
- config_name: cluster_id=3
data_files:
- split: train
path: cluster_id=3/train-*
- config_name: cluster_id=4
data_files:
- split: train
path: cluster_id=4/train-*
- config_name: cluster_id=5
data_files:
- split: train
path: cluster_id=5/train-*
- config_name: cluster_id=6
data_files:
- split: train
path: cluster_id=6/train-*
- config_name: cluster_id=7
data_files:
- split: train
path: cluster_id=7/train-*
- config_name: cluster_id=8
data_files:
- split: train
path: cluster_id=8/train-*
- config_name: cluster_id=9
data_files:
- split: train
path: cluster_id=9/train-*
- config_name: cluster_id=10
data_files:
- split: train
path: cluster_id=10/train-*
- config_name: cluster_id=11
data_files:
- split: train
path: cluster_id=11/train-*
- config_name: cluster_id=12
data_files:
- split: train
path: cluster_id=12/train-*
- config_name: cluster_id=13
data_files:
- split: train
path: cluster_id=13/train-*
- config_name: cluster_id=14
data_files:
- split: train
path: cluster_id=14/train-*
- config_name: cluster_id=15
data_files:
- split: train
path: cluster_id=15/train-*
- config_name: cluster_id=16
data_files:
- split: train
path: cluster_id=16/train-*
- config_name: cluster_id=17
data_files:
- split: train
path: cluster_id=17/train-*
- config_name: cluster_id=18
data_files:
- split: train
path: cluster_id=18/train-*
- config_name: cluster_id=19
data_files:
- split: train
path: cluster_id=19/train-*
- config_name: cluster_id=20
data_files:
- split: train
path: cluster_id=20/train-*
---
# ClimbMix100K
## About
Subsampled version of [gvlassis/ClimbMix](https://huggingface.co/datasets/gvlassis/ClimbMix) containing **100,000** samples while preserving the original 20-cluster ratio distribution.
## Description
This dataset is created by stream-sampling from `gvlassis/ClimbMix` without downloading the full 553M rows. The sampling preserves the exact ratio distribution across all 20 clusters using the largest-remainder method.
### Per-cluster quotas:
| cluster_id | topics | documents | ratio |
|---|---|---|---|
| 1 | Mathematics, Statistics, Education, Online Tutoring | 865 | 0.86% |
| 2 | History, Mathematics, Literature, Religion | 1,208 | 1.21% |
| 3 | Medieval History, Music History, Art and Culture | 1,446 | 1.45% |
| 4 | Education, Wellbeing, Digital Learning, STEM | 3,858 | 3.86% |
| 5 | Career, Education, Finance, Technology | 1,889 | 1.89% |
| 6 | Aluminum, Physics, Biology, AI & Robotics | 17,778 | 17.78% |
| 7 | Conservation, Wildlife, Plants, Pets | 16,729 | 16.73% |
| 8 | Gaming, Gambling | 1,167 | 1.17% |
| 9 | Astronomy, Space, Astrophysics | 812 | 0.81% |
| 10 | Leadership, Health, Education, Safety | 7,338 | 7.34% |
| 11 | Programming, WebDesign | 1,560 | 1.56% |
| 12 | Photography, Technical, Food, Crafts | 25,684 | 25.68% |
| 13 | Sports | 904 | 0.90% |
| 14 | Music, Composition, Performance | 277 | 0.28% |
| 15 | Fantasy, Animation, Fiction | 234 | 0.23% |
| 16 | Environment, Energy, Sustainability | 7,283 | 7.28% |
| 17 | Health, Nutrition, Disease, Medicine | 7,022 | 7.02% |
| 18 | Performance, Security, Networking, Privacy | 2,275 | 2.27% |
| 19 | Computers, Relationships, Social Issues, Culture | 1,163 | 1.16% |
| 20 | Women's History, Immigration, Politics, Public Health | 508 | 0.51% |
**Total**: 100,000 samples
## Usage
```python
import datasets
# Load a specific cluster
dataset = datasets.load_dataset("aimlresearch2023/ClimbMix100K", "cluster_id=12", split="train")
# Or load all clusters
dataset_dict = datasets.load_dataset("aimlresearch2023/ClimbMix100K")
```
## Source
- **Source dataset**: [gvlassis/ClimbMix](https://huggingface.co/datasets/gvlassis/ClimbMix)
- **Original paper**: [ClimbMix](https://arxiv.org/abs/2504.13161)
- **Sampling method**: Stream-based ratio-preserving subsampling
提供机构:
aimlresearch2023



