five

aimlresearch2023/ClimbMix100K

收藏
Hugging Face2026-03-06 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/aimlresearch2023/ClimbMix100K
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: cluster_id=1 features: - name: text dtype: string splits: - name: train num_bytes: 1257710 num_examples: 865 - config_name: cluster_id=2 features: - name: text dtype: string splits: - name: train num_bytes: 4491344 num_examples: 1208 - config_name: cluster_id=3 features: - name: text dtype: string splits: - name: train num_bytes: 2287572 num_examples: 1446 - config_name: cluster_id=4 features: - name: text dtype: string splits: - name: train num_bytes: 3641952 num_examples: 3858 - config_name: cluster_id=5 features: - name: text dtype: string splits: - name: train num_bytes: 6401821 num_examples: 1889 - config_name: cluster_id=6 features: - name: text dtype: string splits: - name: train num_bytes: 79414326 num_examples: 17778 - config_name: cluster_id=7 features: - name: text dtype: string splits: - name: train num_bytes: 61947487 num_examples: 16729 - config_name: cluster_id=8 features: - name: text dtype: string splits: - name: train num_bytes: 7140873 num_examples: 1167 - config_name: cluster_id=9 features: - name: text dtype: string splits: - name: train num_bytes: 2405144 num_examples: 812 - config_name: cluster_id=10 features: - name: text dtype: string splits: - name: train num_bytes: 15108942 num_examples: 7338 - config_name: cluster_id=11 features: - name: text dtype: string splits: - name: train num_bytes: 4798560 num_examples: 1560 - config_name: cluster_id=12 features: - name: text dtype: string splits: - name: train num_bytes: 75588012 num_examples: 25684 - config_name: cluster_id=13 features: - name: text dtype: string splits: - name: train num_bytes: 256736 num_examples: 904 - config_name: cluster_id=14 features: - name: text dtype: string splits: - name: train num_bytes: 82269 num_examples: 277 - config_name: cluster_id=15 features: - name: text dtype: string splits: - name: train num_bytes: 1922778 num_examples: 234 - config_name: cluster_id=16 features: - name: text dtype: string splits: - name: train num_bytes: 24718502 num_examples: 7283 - config_name: cluster_id=17 features: - name: text dtype: string splits: - name: train num_bytes: 7246704 num_examples: 7022 - config_name: cluster_id=18 features: - name: text dtype: string splits: - name: train num_bytes: 6249425 num_examples: 2275 - config_name: cluster_id=19 features: - name: text dtype: string splits: - name: train num_bytes: 3190109 num_examples: 1163 - config_name: cluster_id=20 features: - name: text dtype: string splits: - name: train num_bytes: 283972 num_examples: 508 configs: - config_name: cluster_id=1 data_files: - split: train path: cluster_id=1/train-* - config_name: cluster_id=2 data_files: - split: train path: cluster_id=2/train-* - config_name: cluster_id=3 data_files: - split: train path: cluster_id=3/train-* - config_name: cluster_id=4 data_files: - split: train path: cluster_id=4/train-* - config_name: cluster_id=5 data_files: - split: train path: cluster_id=5/train-* - config_name: cluster_id=6 data_files: - split: train path: cluster_id=6/train-* - config_name: cluster_id=7 data_files: - split: train path: cluster_id=7/train-* - config_name: cluster_id=8 data_files: - split: train path: cluster_id=8/train-* - config_name: cluster_id=9 data_files: - split: train path: cluster_id=9/train-* - config_name: cluster_id=10 data_files: - split: train path: cluster_id=10/train-* - config_name: cluster_id=11 data_files: - split: train path: cluster_id=11/train-* - config_name: cluster_id=12 data_files: - split: train path: cluster_id=12/train-* - config_name: cluster_id=13 data_files: - split: train path: cluster_id=13/train-* - config_name: cluster_id=14 data_files: - split: train path: cluster_id=14/train-* - config_name: cluster_id=15 data_files: - split: train path: cluster_id=15/train-* - config_name: cluster_id=16 data_files: - split: train path: cluster_id=16/train-* - config_name: cluster_id=17 data_files: - split: train path: cluster_id=17/train-* - config_name: cluster_id=18 data_files: - split: train path: cluster_id=18/train-* - config_name: cluster_id=19 data_files: - split: train path: cluster_id=19/train-* - config_name: cluster_id=20 data_files: - split: train path: cluster_id=20/train-* --- # ClimbMix100K ## About Subsampled version of [gvlassis/ClimbMix](https://huggingface.co/datasets/gvlassis/ClimbMix) containing **100,000** samples while preserving the original 20-cluster ratio distribution. ## Description This dataset is created by stream-sampling from `gvlassis/ClimbMix` without downloading the full 553M rows. The sampling preserves the exact ratio distribution across all 20 clusters using the largest-remainder method. ### Per-cluster quotas: | cluster_id | topics | documents | ratio | |---|---|---|---| | 1 | Mathematics, Statistics, Education, Online Tutoring | 865 | 0.86% | | 2 | History, Mathematics, Literature, Religion | 1,208 | 1.21% | | 3 | Medieval History, Music History, Art and Culture | 1,446 | 1.45% | | 4 | Education, Wellbeing, Digital Learning, STEM | 3,858 | 3.86% | | 5 | Career, Education, Finance, Technology | 1,889 | 1.89% | | 6 | Aluminum, Physics, Biology, AI & Robotics | 17,778 | 17.78% | | 7 | Conservation, Wildlife, Plants, Pets | 16,729 | 16.73% | | 8 | Gaming, Gambling | 1,167 | 1.17% | | 9 | Astronomy, Space, Astrophysics | 812 | 0.81% | | 10 | Leadership, Health, Education, Safety | 7,338 | 7.34% | | 11 | Programming, WebDesign | 1,560 | 1.56% | | 12 | Photography, Technical, Food, Crafts | 25,684 | 25.68% | | 13 | Sports | 904 | 0.90% | | 14 | Music, Composition, Performance | 277 | 0.28% | | 15 | Fantasy, Animation, Fiction | 234 | 0.23% | | 16 | Environment, Energy, Sustainability | 7,283 | 7.28% | | 17 | Health, Nutrition, Disease, Medicine | 7,022 | 7.02% | | 18 | Performance, Security, Networking, Privacy | 2,275 | 2.27% | | 19 | Computers, Relationships, Social Issues, Culture | 1,163 | 1.16% | | 20 | Women's History, Immigration, Politics, Public Health | 508 | 0.51% | **Total**: 100,000 samples ## Usage ```python import datasets # Load a specific cluster dataset = datasets.load_dataset("aimlresearch2023/ClimbMix100K", "cluster_id=12", split="train") # Or load all clusters dataset_dict = datasets.load_dataset("aimlresearch2023/ClimbMix100K") ``` ## Source - **Source dataset**: [gvlassis/ClimbMix](https://huggingface.co/datasets/gvlassis/ClimbMix) - **Original paper**: [ClimbMix](https://arxiv.org/abs/2504.13161) - **Sampling method**: Stream-based ratio-preserving subsampling
提供机构:
aimlresearch2023
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作