five

aimlresearch2023/ClimbMix1M

收藏
Hugging Face2026-03-06 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/aimlresearch2023/ClimbMix1M
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: cluster_id=1 features: - name: text dtype: string splits: - name: train num_bytes: 13989357 num_examples: 8648 - config_name: cluster_id=2 features: - name: text dtype: string splits: - name: train num_bytes: 23295805 num_examples: 12081 - config_name: cluster_id=3 features: - name: text dtype: string splits: - name: train num_bytes: 29071108 num_examples: 14464 - config_name: cluster_id=4 features: - name: text dtype: string splits: - name: train num_bytes: 66025241 num_examples: 38584 - config_name: cluster_id=5 features: - name: text dtype: string splits: - name: train num_bytes: 32931352 num_examples: 18888 - config_name: cluster_id=6 features: - name: text dtype: string splits: - name: train num_bytes: 342399630 num_examples: 177780 - config_name: cluster_id=7 features: - name: text dtype: string splits: - name: train num_bytes: 317848347 num_examples: 167285 - config_name: cluster_id=8 features: - name: text dtype: string splits: - name: train num_bytes: 19017735 num_examples: 11667 - config_name: cluster_id=9 features: - name: text dtype: string splits: - name: train num_bytes: 14530608 num_examples: 8121 - config_name: cluster_id=10 features: - name: text dtype: string splits: - name: train num_bytes: 138407157 num_examples: 73382 - config_name: cluster_id=11 features: - name: text dtype: string splits: - name: train num_bytes: 21765635 num_examples: 15598 - config_name: cluster_id=12 features: - name: text dtype: string splits: - name: train num_bytes: 383527009 num_examples: 256836 - config_name: cluster_id=13 features: - name: text dtype: string splits: - name: train num_bytes: 12441679 num_examples: 9044 - config_name: cluster_id=14 features: - name: text dtype: string splits: - name: train num_bytes: 4301526 num_examples: 2767 - config_name: cluster_id=15 features: - name: text dtype: string splits: - name: train num_bytes: 5018255 num_examples: 2343 - config_name: cluster_id=16 features: - name: text dtype: string splits: - name: train num_bytes: 144631837 num_examples: 72829 - config_name: cluster_id=17 features: - name: text dtype: string splits: - name: train num_bytes: 133892438 num_examples: 70221 - config_name: cluster_id=18 features: - name: text dtype: string splits: - name: train num_bytes: 35279430 num_examples: 22747 - config_name: cluster_id=19 features: - name: text dtype: string splits: - name: train num_bytes: 19735725 num_examples: 11634 - config_name: cluster_id=20 features: - name: text dtype: string splits: - name: train num_bytes: 9562626 num_examples: 5081 configs: - config_name: cluster_id=1 data_files: - split: train path: cluster_id=1/train-* - config_name: cluster_id=2 data_files: - split: train path: cluster_id=2/train-* - config_name: cluster_id=3 data_files: - split: train path: cluster_id=3/train-* - config_name: cluster_id=4 data_files: - split: train path: cluster_id=4/train-* - config_name: cluster_id=5 data_files: - split: train path: cluster_id=5/train-* - config_name: cluster_id=6 data_files: - split: train path: cluster_id=6/train-* - config_name: cluster_id=7 data_files: - split: train path: cluster_id=7/train-* - config_name: cluster_id=8 data_files: - split: train path: cluster_id=8/train-* - config_name: cluster_id=9 data_files: - split: train path: cluster_id=9/train-* - config_name: cluster_id=10 data_files: - split: train path: cluster_id=10/train-* - config_name: cluster_id=11 data_files: - split: train path: cluster_id=11/train-* - config_name: cluster_id=12 data_files: - split: train path: cluster_id=12/train-* - config_name: cluster_id=13 data_files: - split: train path: cluster_id=13/train-* - config_name: cluster_id=14 data_files: - split: train path: cluster_id=14/train-* - config_name: cluster_id=15 data_files: - split: train path: cluster_id=15/train-* - config_name: cluster_id=16 data_files: - split: train path: cluster_id=16/train-* - config_name: cluster_id=17 data_files: - split: train path: cluster_id=17/train-* - config_name: cluster_id=18 data_files: - split: train path: cluster_id=18/train-* - config_name: cluster_id=19 data_files: - split: train path: cluster_id=19/train-* - config_name: cluster_id=20 data_files: - split: train path: cluster_id=20/train-* --- # ClimbMix1M ## About Subsampled version of [gvlassis/ClimbMix](https://huggingface.co/datasets/gvlassis/ClimbMix) containing **1,000,000** samples while preserving the original 20-cluster ratio distribution. ## Description This dataset is created by stream-sampling from `gvlassis/ClimbMix` without downloading the full 553M rows. The sampling preserves the exact ratio distribution across all 20 clusters using the largest-remainder method. ### Per-cluster quotas: | cluster_id | topics | documents | ratio | |---|---|---|---| | 1 | Mathematics, Statistics, Education, Online Tutoring | 8,648 | 0.86% | | 2 | History, Mathematics, Literature, Religion | 12,081 | 1.21% | | 3 | Medieval History, Music History, Art and Culture | 14,464 | 1.45% | | 4 | Education, Wellbeing, Digital Learning, STEM | 38,584 | 3.86% | | 5 | Career, Education, Finance, Technology | 18,888 | 1.89% | | 6 | Aluminum, Physics, Biology, AI & Robotics | 177,780 | 17.78% | | 7 | Conservation, Wildlife, Plants, Pets | 167,285 | 16.73% | | 8 | Gaming, Gambling | 11,667 | 1.17% | | 9 | Astronomy, Space, Astrophysics | 8,121 | 0.81% | | 10 | Leadership, Health, Education, Safety | 73,382 | 7.34% | | 11 | Programming, WebDesign | 15,598 | 1.56% | | 12 | Photography, Technical, Food, Crafts | 256,836 | 25.68% | | 13 | Sports | 9,044 | 0.90% | | 14 | Music, Composition, Performance | 2,767 | 0.28% | | 15 | Fantasy, Animation, Fiction | 2,343 | 0.23% | | 16 | Environment, Energy, Sustainability | 72,829 | 7.28% | | 17 | Health, Nutrition, Disease, Medicine | 70,221 | 7.02% | | 18 | Performance, Security, Networking, Privacy | 22,747 | 2.27% | | 19 | Computers, Relationships, Social Issues, Culture | 11,634 | 1.16% | | 20 | Women's History, Immigration, Politics, Public Health | 5,081 | 0.51% | **Total**: 1,000,000 samples ## Usage ```python import datasets # Load a specific cluster dataset = datasets.load_dataset("aimlresearch2023/ClimbMix1M", "cluster_id=12", split="train") # Or load all clusters dataset_dict = datasets.load_dataset("aimlresearch2023/ClimbMix1M") ``` ## Source - **Source dataset**: [gvlassis/ClimbMix](https://huggingface.co/datasets/gvlassis/ClimbMix) - **Original paper**: [ClimbMix](https://arxiv.org/abs/2504.13161) - **Sampling method**: Stream-based ratio-preserving subsampling
提供机构:
aimlresearch2023
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作