aimlresearch2023/ClimbMix10M

Name: aimlresearch2023/ClimbMix10M
Creator: aimlresearch2023
Published: 2026-03-06 12:54:18
License: 暂无描述

Hugging Face2026-03-06 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/aimlresearch2023/ClimbMix10M

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: - config_name: cluster_id=1 features: - name: text dtype: string splits: - name: train num_bytes: 135621816 num_examples: 86481 - config_name: cluster_id=2 features: - name: text dtype: string splits: - name: train num_bytes: 229961588 num_examples: 120810 - config_name: cluster_id=3 features: - name: text dtype: string splits: - name: train num_bytes: 294157951 num_examples: 144639 - config_name: cluster_id=4 features: - name: text dtype: string splits: - name: train num_bytes: 656231983 num_examples: 385838 - config_name: cluster_id=5 features: - name: text dtype: string splits: - name: train num_bytes: 330747972 num_examples: 188878 - config_name: cluster_id=6 features: - name: text dtype: string splits: - name: train num_bytes: 3399526714 num_examples: 1777803 - config_name: cluster_id=7 features: - name: text dtype: string splits: - name: train num_bytes: 3175368570 num_examples: 1672850 - config_name: cluster_id=8 features: - name: text dtype: string splits: - name: train num_bytes: 187566804 num_examples: 116670 - config_name: cluster_id=9 features: - name: text dtype: string splits: - name: train num_bytes: 152143870 num_examples: 81211 - config_name: cluster_id=10 features: - name: text dtype: string splits: - name: train num_bytes: 1375544319 num_examples: 733824 - config_name: cluster_id=11 features: - name: text dtype: string splits: - name: train num_bytes: 221867973 num_examples: 155980 - config_name: cluster_id=12 features: - name: text dtype: string splits: - name: train num_bytes: 3842222865 num_examples: 2568358 - config_name: cluster_id=13 features: - name: text dtype: string splits: - name: train num_bytes: 126857494 num_examples: 90438 - config_name: cluster_id=14 features: - name: text dtype: string splits: - name: train num_bytes: 41195392 num_examples: 27670 - config_name: cluster_id=15 features: - name: text dtype: string splits: - name: train num_bytes: 40711634 num_examples: 23429 - config_name: cluster_id=16 features: - name: text dtype: string splits: - name: train num_bytes: 1437196013 num_examples: 728288 - config_name: cluster_id=17 features: - name: text dtype: string splits: - name: train num_bytes: 1330873355 num_examples: 702212 - config_name: cluster_id=18 features: - name: text dtype: string splits: - name: train num_bytes: 354273462 num_examples: 227472 - config_name: cluster_id=19 features: - name: text dtype: string splits: - name: train num_bytes: 191925300 num_examples: 116340 - config_name: cluster_id=20 features: - name: text dtype: string splits: - name: train num_bytes: 98454747 num_examples: 50809 configs: - config_name: cluster_id=1 data_files: - split: train path: cluster_id=1/train-* - config_name: cluster_id=2 data_files: - split: train path: cluster_id=2/train-* - config_name: cluster_id=3 data_files: - split: train path: cluster_id=3/train-* - config_name: cluster_id=4 data_files: - split: train path: cluster_id=4/train-* - config_name: cluster_id=5 data_files: - split: train path: cluster_id=5/train-* - config_name: cluster_id=6 data_files: - split: train path: cluster_id=6/train-* - config_name: cluster_id=7 data_files: - split: train path: cluster_id=7/train-* - config_name: cluster_id=8 data_files: - split: train path: cluster_id=8/train-* - config_name: cluster_id=9 data_files: - split: train path: cluster_id=9/train-* - config_name: cluster_id=10 data_files: - split: train path: cluster_id=10/train-* - config_name: cluster_id=11 data_files: - split: train path: cluster_id=11/train-* - config_name: cluster_id=12 data_files: - split: train path: cluster_id=12/train-* - config_name: cluster_id=13 data_files: - split: train path: cluster_id=13/train-* - config_name: cluster_id=14 data_files: - split: train path: cluster_id=14/train-* - config_name: cluster_id=15 data_files: - split: train path: cluster_id=15/train-* - config_name: cluster_id=16 data_files: - split: train path: cluster_id=16/train-* - config_name: cluster_id=17 data_files: - split: train path: cluster_id=17/train-* - config_name: cluster_id=18 data_files: - split: train path: cluster_id=18/train-* - config_name: cluster_id=19 data_files: - split: train path: cluster_id=19/train-* - config_name: cluster_id=20 data_files: - split: train path: cluster_id=20/train-* --- # climbmix10M ## About Subsampled version of [gvlassis/ClimbMix](https://huggingface.co/datasets/gvlassis/ClimbMix) containing **10,000,000** samples while preserving the original 20-cluster ratio distribution. ## Description This dataset is created by stream-sampling from `gvlassis/ClimbMix` without downloading the full 553M rows. The sampling preserves the exact ratio distribution across all 20 clusters using the largest-remainder method. ### Per-cluster quotas: | cluster_id | topics | documents | ratio | |---|---|---|---| | 1 | Mathematics, Statistics, Education, Online Tutoring | 86,481 | 0.86% | | 2 | History, Mathematics, Literature, Religion | 120,810 | 1.21% | | 3 | Medieval History, Music History, Art and Culture | 144,639 | 1.45% | | 4 | Education, Wellbeing, Digital Learning, STEM | 385,838 | 3.86% | | 5 | Career, Education, Finance, Technology | 188,878 | 1.89% | | 6 | Aluminum, Physics, Biology, AI & Robotics | 1,777,803 | 17.78% | | 7 | Conservation, Wildlife, Plants, Pets | 1,672,850 | 16.73% | | 8 | Gaming, Gambling | 116,670 | 1.17% | | 9 | Astronomy, Space, Astrophysics | 81,211 | 0.81% | | 10 | Leadership, Health, Education, Safety | 733,824 | 7.34% | | 11 | Programming, WebDesign | 155,980 | 1.56% | | 12 | Photography, Technical, Food, Crafts | 2,568,358 | 25.68% | | 13 | Sports | 90,438 | 0.90% | | 14 | Music, Composition, Performance | 27,670 | 0.28% | | 15 | Fantasy, Animation, Fiction | 23,429 | 0.23% | | 16 | Environment, Energy, Sustainability | 728,288 | 7.28% | | 17 | Health, Nutrition, Disease, Medicine | 702,212 | 7.02% | | 18 | Performance, Security, Networking, Privacy | 227,472 | 2.27% | | 19 | Computers, Relationships, Social Issues, Culture | 116,340 | 1.16% | | 20 | Women's History, Immigration, Politics, Public Health | 50,809 | 0.51% | **Total**: 10,000,000 samples ## Usage ```python import datasets # Load a specific cluster dataset = datasets.load_dataset("aimlresearch2023/climbmix10M", "cluster_id=12", split="train") # Or load all clusters dataset_dict = datasets.load_dataset("aimlresearch2023/climbmix10M") ``` ## Source - **Source dataset**: [gvlassis/ClimbMix](https://huggingface.co/datasets/gvlassis/ClimbMix) - **Original paper**: [ClimbMix](https://arxiv.org/abs/2504.13161) - **Sampling method**: Stream-based ratio-preserving subsampling

提供机构：

aimlresearch2023

5,000+

优质数据集

54 个

任务类型

进入经典数据集