mansaripo/ClimbMix_shuffled
收藏Hugging Face2026-02-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mansaripo/ClimbMix_shuffled
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 955259356361
num_examples: 553315056
download_size: 955259356361
dataset_size: 955259356361
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# ClimbMix Shuffled
A globally shuffled version of [gvlassis/ClimbMix](https://huggingface.co/datasets/gvlassis/ClimbMix) (553M rows).
| Cluster | Topics | Ratio | Rows |
|---------|--------|-------|------|
| 1 | Mathematics, Statistics, Education | 0.86% | 4,785,103 |
| 2 | History, Mathematics, Literature, Religion | 1.21% | 6,684,586 |
| 3 | Medieval History, Music History, Art and Culture | 1.45% | 8,003,099 |
| 4 | Education, Wellbeing, Digital Learning, STEM | 3.86% | 21,348,980 |
| 5 | Career, Education, Finance, Technology | 1.89% | 10,450,928 |
| 6 | Aluminum, Physics, Biology, AI & Robotics | 17.78% | 98,368,523 |
| 7 | Conservation, Wildlife, Plants, Pets | 16.73% | 92,561,323 |
| 8 | Gaming, Gambling | 1.17% | 6,455,507 |
| 9 | Astronomy, Space, Astrophysics | 0.81% | 4,493,536 |
| 10 | Leadership, Health, Education, Safety | 7.34% | 40,603,579 |
| 11 | Programming, WebDesign | 1.56% | 8,630,635 |
| 12 | Photography, Technical, Food, Crafts | 25.68% | 142,111,098 |
| 13 | Sports | 0.90% | 5,004,064 |
| 14 | Music, Composition, Performance | 0.28% | 1,530,996 |
| 15 | Fantasy, Animation, Fiction | 0.23% | 1,296,383 |
| 16 | Environment, Energy, Sustainability | 7.28% | 40,297,278 |
| 17 | Health, Nutrition, Disease, Medicine | 7.02% | 38,854,459 |
| 18 | Performance, Security, Networking, Privacy | 2.27% | 12,586,375 |
| 19 | Computers, Relationships, Social Issues, Culture | 1.16% | 6,437,288 |
| 20 | Women's History, Immigration, Politics, Public Health | 0.51% | 2,811,316 |
| **Total** | | **100%** | **553,315,056** |
数据集信息:
特征:
- 名称:text
数据类型:字符串
数据集划分:
- 名称:训练集
字节数:955259356361
样本数:553315056
下载大小:955259356361
数据集总大小:955259356361
配置项:
- 配置名称:默认配置
数据文件:
- 划分:训练集
路径:data/train-*
# ClimbMix 全局打乱版本
本数据集为[gvlassis/ClimbMix](https://huggingface.co/datasets/gvlassis/ClimbMix)的全局打乱版本,共计5.53亿条数据。
| 簇编号 | 主题范畴 | 占比 | 数据条数 |
|---------|--------|-------|------|
| 1 | 数学、统计学、教育学 | 0.86% | 4,785,103 |
| 2 | 历史学、数学、文学、宗教学 | 1.21% | 6,684,586 |
| 3 | 中世纪史、音乐史、艺术与文化 | 1.45% | 8,003,099 |
| 4 | 教育学、身心健康、数字化学习、STEM | 3.86% | 21,348,980 |
| 5 | 职业发展、教育学、金融学、技术 | 1.89% | 10,450,928 |
| 6 | 铝工业、物理学、生物学、人工智能(AI)与机器人学 | 17.78% | 98,368,523 |
| 7 | 生态保护、野生动物、植物、宠物 | 16.73% | 92,561,323 |
| 8 | 游戏、博彩 | 1.17% | 6,455,507 |
| 9 | 天文学、太空、天体物理学 | 0.81% | 4,493,536 |
| 10 | 领导力、健康、教育学、安全 | 7.34% | 40,603,579 |
| 11 | 编程、网页设计 | 1.56% | 8,630,635 |
| 12 | 摄影、技术类、美食、手工艺 | 25.68% | 142,111,098 |
| 13 | 体育 | 0.90% | 5,004,064 |
| 14 | 音乐、作曲、表演 | 0.28% | 1,530,996 |
| 15 | 奇幻、动画、虚构作品 | 0.23% | 1,296,383 |
| 16 | 环境、能源、可持续发展 | 7.28% | 40,297,278 |
| 17 | 健康、营养学、疾病、医学 | 7.02% | 38,854,459 |
| 18 | 性能、安全、网络、隐私 | 2.27% | 12,586,375 |
| 19 | 计算机、人际关系、社会议题、文化 | 1.16% | 6,437,288 |
| 20 | 女性史、移民、政治、公共卫生 | 0.51% | 2,811,316 |
| **总计** | | **100%** | **553,315,056** |
提供机构:
mansaripo



