five

mansaripo/ClimbMix_shuffled

收藏
Hugging Face2026-02-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mansaripo/ClimbMix_shuffled
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: text dtype: string splits: - name: train num_bytes: 955259356361 num_examples: 553315056 download_size: 955259356361 dataset_size: 955259356361 configs: - config_name: default data_files: - split: train path: data/train-* --- # ClimbMix Shuffled A globally shuffled version of [gvlassis/ClimbMix](https://huggingface.co/datasets/gvlassis/ClimbMix) (553M rows). | Cluster | Topics | Ratio | Rows | |---------|--------|-------|------| | 1 | Mathematics, Statistics, Education | 0.86% | 4,785,103 | | 2 | History, Mathematics, Literature, Religion | 1.21% | 6,684,586 | | 3 | Medieval History, Music History, Art and Culture | 1.45% | 8,003,099 | | 4 | Education, Wellbeing, Digital Learning, STEM | 3.86% | 21,348,980 | | 5 | Career, Education, Finance, Technology | 1.89% | 10,450,928 | | 6 | Aluminum, Physics, Biology, AI & Robotics | 17.78% | 98,368,523 | | 7 | Conservation, Wildlife, Plants, Pets | 16.73% | 92,561,323 | | 8 | Gaming, Gambling | 1.17% | 6,455,507 | | 9 | Astronomy, Space, Astrophysics | 0.81% | 4,493,536 | | 10 | Leadership, Health, Education, Safety | 7.34% | 40,603,579 | | 11 | Programming, WebDesign | 1.56% | 8,630,635 | | 12 | Photography, Technical, Food, Crafts | 25.68% | 142,111,098 | | 13 | Sports | 0.90% | 5,004,064 | | 14 | Music, Composition, Performance | 0.28% | 1,530,996 | | 15 | Fantasy, Animation, Fiction | 0.23% | 1,296,383 | | 16 | Environment, Energy, Sustainability | 7.28% | 40,297,278 | | 17 | Health, Nutrition, Disease, Medicine | 7.02% | 38,854,459 | | 18 | Performance, Security, Networking, Privacy | 2.27% | 12,586,375 | | 19 | Computers, Relationships, Social Issues, Culture | 1.16% | 6,437,288 | | 20 | Women's History, Immigration, Politics, Public Health | 0.51% | 2,811,316 | | **Total** | | **100%** | **553,315,056** |

数据集信息: 特征: - 名称:text 数据类型:字符串 数据集划分: - 名称:训练集 字节数:955259356361 样本数:553315056 下载大小:955259356361 数据集总大小:955259356361 配置项: - 配置名称:默认配置 数据文件: - 划分:训练集 路径:data/train-* # ClimbMix 全局打乱版本 本数据集为[gvlassis/ClimbMix](https://huggingface.co/datasets/gvlassis/ClimbMix)的全局打乱版本,共计5.53亿条数据。 | 簇编号 | 主题范畴 | 占比 | 数据条数 | |---------|--------|-------|------| | 1 | 数学、统计学、教育学 | 0.86% | 4,785,103 | | 2 | 历史学、数学、文学、宗教学 | 1.21% | 6,684,586 | | 3 | 中世纪史、音乐史、艺术与文化 | 1.45% | 8,003,099 | | 4 | 教育学、身心健康、数字化学习、STEM | 3.86% | 21,348,980 | | 5 | 职业发展、教育学、金融学、技术 | 1.89% | 10,450,928 | | 6 | 铝工业、物理学、生物学、人工智能(AI)与机器人学 | 17.78% | 98,368,523 | | 7 | 生态保护、野生动物、植物、宠物 | 16.73% | 92,561,323 | | 8 | 游戏、博彩 | 1.17% | 6,455,507 | | 9 | 天文学、太空、天体物理学 | 0.81% | 4,493,536 | | 10 | 领导力、健康、教育学、安全 | 7.34% | 40,603,579 | | 11 | 编程、网页设计 | 1.56% | 8,630,635 | | 12 | 摄影、技术类、美食、手工艺 | 25.68% | 142,111,098 | | 13 | 体育 | 0.90% | 5,004,064 | | 14 | 音乐、作曲、表演 | 0.28% | 1,530,996 | | 15 | 奇幻、动画、虚构作品 | 0.23% | 1,296,383 | | 16 | 环境、能源、可持续发展 | 7.28% | 40,297,278 | | 17 | 健康、营养学、疾病、医学 | 7.02% | 38,854,459 | | 18 | 性能、安全、网络、隐私 | 2.27% | 12,586,375 | | 19 | 计算机、人际关系、社会议题、文化 | 1.16% | 6,437,288 | | 20 | 女性史、移民、政治、公共卫生 | 0.51% | 2,811,316 | | **总计** | | **100%** | **553,315,056** |
提供机构:
mansaripo
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作