aimlresearch2023/ClimbMix1M
收藏Hugging Face2026-03-06 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/aimlresearch2023/ClimbMix1M
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: cluster_id=1
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 13989357
num_examples: 8648
- config_name: cluster_id=2
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 23295805
num_examples: 12081
- config_name: cluster_id=3
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 29071108
num_examples: 14464
- config_name: cluster_id=4
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 66025241
num_examples: 38584
- config_name: cluster_id=5
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 32931352
num_examples: 18888
- config_name: cluster_id=6
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 342399630
num_examples: 177780
- config_name: cluster_id=7
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 317848347
num_examples: 167285
- config_name: cluster_id=8
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 19017735
num_examples: 11667
- config_name: cluster_id=9
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 14530608
num_examples: 8121
- config_name: cluster_id=10
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 138407157
num_examples: 73382
- config_name: cluster_id=11
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 21765635
num_examples: 15598
- config_name: cluster_id=12
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 383527009
num_examples: 256836
- config_name: cluster_id=13
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 12441679
num_examples: 9044
- config_name: cluster_id=14
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 4301526
num_examples: 2767
- config_name: cluster_id=15
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 5018255
num_examples: 2343
- config_name: cluster_id=16
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 144631837
num_examples: 72829
- config_name: cluster_id=17
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 133892438
num_examples: 70221
- config_name: cluster_id=18
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 35279430
num_examples: 22747
- config_name: cluster_id=19
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 19735725
num_examples: 11634
- config_name: cluster_id=20
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 9562626
num_examples: 5081
configs:
- config_name: cluster_id=1
data_files:
- split: train
path: cluster_id=1/train-*
- config_name: cluster_id=2
data_files:
- split: train
path: cluster_id=2/train-*
- config_name: cluster_id=3
data_files:
- split: train
path: cluster_id=3/train-*
- config_name: cluster_id=4
data_files:
- split: train
path: cluster_id=4/train-*
- config_name: cluster_id=5
data_files:
- split: train
path: cluster_id=5/train-*
- config_name: cluster_id=6
data_files:
- split: train
path: cluster_id=6/train-*
- config_name: cluster_id=7
data_files:
- split: train
path: cluster_id=7/train-*
- config_name: cluster_id=8
data_files:
- split: train
path: cluster_id=8/train-*
- config_name: cluster_id=9
data_files:
- split: train
path: cluster_id=9/train-*
- config_name: cluster_id=10
data_files:
- split: train
path: cluster_id=10/train-*
- config_name: cluster_id=11
data_files:
- split: train
path: cluster_id=11/train-*
- config_name: cluster_id=12
data_files:
- split: train
path: cluster_id=12/train-*
- config_name: cluster_id=13
data_files:
- split: train
path: cluster_id=13/train-*
- config_name: cluster_id=14
data_files:
- split: train
path: cluster_id=14/train-*
- config_name: cluster_id=15
data_files:
- split: train
path: cluster_id=15/train-*
- config_name: cluster_id=16
data_files:
- split: train
path: cluster_id=16/train-*
- config_name: cluster_id=17
data_files:
- split: train
path: cluster_id=17/train-*
- config_name: cluster_id=18
data_files:
- split: train
path: cluster_id=18/train-*
- config_name: cluster_id=19
data_files:
- split: train
path: cluster_id=19/train-*
- config_name: cluster_id=20
data_files:
- split: train
path: cluster_id=20/train-*
---
# ClimbMix1M
## About
Subsampled version of [gvlassis/ClimbMix](https://huggingface.co/datasets/gvlassis/ClimbMix) containing **1,000,000** samples while preserving the original 20-cluster ratio distribution.
## Description
This dataset is created by stream-sampling from `gvlassis/ClimbMix` without downloading the full 553M rows. The sampling preserves the exact ratio distribution across all 20 clusters using the largest-remainder method.
### Per-cluster quotas:
| cluster_id | topics | documents | ratio |
|---|---|---|---|
| 1 | Mathematics, Statistics, Education, Online Tutoring | 8,648 | 0.86% |
| 2 | History, Mathematics, Literature, Religion | 12,081 | 1.21% |
| 3 | Medieval History, Music History, Art and Culture | 14,464 | 1.45% |
| 4 | Education, Wellbeing, Digital Learning, STEM | 38,584 | 3.86% |
| 5 | Career, Education, Finance, Technology | 18,888 | 1.89% |
| 6 | Aluminum, Physics, Biology, AI & Robotics | 177,780 | 17.78% |
| 7 | Conservation, Wildlife, Plants, Pets | 167,285 | 16.73% |
| 8 | Gaming, Gambling | 11,667 | 1.17% |
| 9 | Astronomy, Space, Astrophysics | 8,121 | 0.81% |
| 10 | Leadership, Health, Education, Safety | 73,382 | 7.34% |
| 11 | Programming, WebDesign | 15,598 | 1.56% |
| 12 | Photography, Technical, Food, Crafts | 256,836 | 25.68% |
| 13 | Sports | 9,044 | 0.90% |
| 14 | Music, Composition, Performance | 2,767 | 0.28% |
| 15 | Fantasy, Animation, Fiction | 2,343 | 0.23% |
| 16 | Environment, Energy, Sustainability | 72,829 | 7.28% |
| 17 | Health, Nutrition, Disease, Medicine | 70,221 | 7.02% |
| 18 | Performance, Security, Networking, Privacy | 22,747 | 2.27% |
| 19 | Computers, Relationships, Social Issues, Culture | 11,634 | 1.16% |
| 20 | Women's History, Immigration, Politics, Public Health | 5,081 | 0.51% |
**Total**: 1,000,000 samples
## Usage
```python
import datasets
# Load a specific cluster
dataset = datasets.load_dataset("aimlresearch2023/ClimbMix1M", "cluster_id=12", split="train")
# Or load all clusters
dataset_dict = datasets.load_dataset("aimlresearch2023/ClimbMix1M")
```
## Source
- **Source dataset**: [gvlassis/ClimbMix](https://huggingface.co/datasets/gvlassis/ClimbMix)
- **Original paper**: [ClimbMix](https://arxiv.org/abs/2504.13161)
- **Sampling method**: Stream-based ratio-preserving subsampling
提供机构:
aimlresearch2023



