Ba2han/English_mix-tokenized
收藏Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Ba2han/English_mix-tokenized
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: input_ids
sequence: uint32
config_name: default
---
# English Mix Tokenized (uint32)
Processed and tokenized dataset mix.
## Settings
- **Min Chars**: 500
- **Min Tokens**: 50
- **Max Tokens**: 2555 (Truncated with BOS/EOS)
- **Format**: uint32 (input_ids only)
- **Deduplication**: Exact MD5 check
- **ClimbMix1M subsets loaded**: 20 (`cluster_id=1, cluster_id=2, cluster_id=3, cluster_id=4, cluster_id=5, cluster_id=6, cluster_id=7, cluster_id=8, cluster_id=9, cluster_id=10, cluster_id=11, cluster_id=12, cluster_id=13, cluster_id=14, cluster_id=15, cluster_id=16, cluster_id=17, cluster_id=18, cluster_id=19, cluster_id=20`)
## Token Counts per Source
| Dataset Source (Config) | Tokens Contributed |
|-------------------------|--------------------|
| aimlresearch2023/ClimbMix1M (cluster_id=1) | 4,651,318 |
| aimlresearch2023/ClimbMix1M (cluster_id=2) | 7,061,287 |
| aimlresearch2023/ClimbMix1M (cluster_id=3) | 8,706,109 |
| aimlresearch2023/ClimbMix1M (cluster_id=4) | 20,735,992 |
| aimlresearch2023/ClimbMix1M (cluster_id=5) | 10,585,765 |
| aimlresearch2023/ClimbMix1M (cluster_id=6) | 104,203,918 |
| aimlresearch2023/ClimbMix1M (cluster_id=7) | 102,564,991 |
| aimlresearch2023/ClimbMix1M (cluster_id=8) | 6,102,443 |
| aimlresearch2023/ClimbMix1M (cluster_id=9) | 4,626,666 |
| aimlresearch2023/ClimbMix1M (cluster_id=10) | 43,601,295 |
| aimlresearch2023/ClimbMix1M (cluster_id=11) | 7,746,721 |
| aimlresearch2023/ClimbMix1M (cluster_id=12) | 126,992,730 |
| aimlresearch2023/ClimbMix1M (cluster_id=13) | 4,134,430 |
| aimlresearch2023/ClimbMix1M (cluster_id=14) | 1,389,200 |
| aimlresearch2023/ClimbMix1M (cluster_id=15) | 1,279,855 |
| aimlresearch2023/ClimbMix1M (cluster_id=16) | 43,829,104 |
| aimlresearch2023/ClimbMix1M (cluster_id=17) | 43,030,046 |
| aimlresearch2023/ClimbMix1M (cluster_id=18) | 11,558,760 |
| aimlresearch2023/ClimbMix1M (cluster_id=19) | 6,110,599 |
| aimlresearch2023/ClimbMix1M (cluster_id=20) | 2,907,596 |
| vm2825/nemotron-cc-v21-Parsed-QA4-filtered-1.7B-evensplit-RQ-8B-CoT-8B1-parts-0-23 (None) | 0 |
| pszemraj/simple_wikipedia_LM (default) | 33,519,169 |
| mbukowski/wikipedia-summary-dataset (None) | 432,034,242 |
| **Total** | **1,027,372,236** |
提供机构:
Ba2han



