five

Ba2han/English_mix-tokenized

收藏
Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Ba2han/English_mix-tokenized
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: input_ids sequence: uint32 config_name: default --- # English Mix Tokenized (uint32) Processed and tokenized dataset mix. ## Settings - **Min Chars**: 500 - **Min Tokens**: 50 - **Max Tokens**: 2555 (Truncated with BOS/EOS) - **Format**: uint32 (input_ids only) - **Deduplication**: Exact MD5 check - **ClimbMix1M subsets loaded**: 20 (`cluster_id=1, cluster_id=2, cluster_id=3, cluster_id=4, cluster_id=5, cluster_id=6, cluster_id=7, cluster_id=8, cluster_id=9, cluster_id=10, cluster_id=11, cluster_id=12, cluster_id=13, cluster_id=14, cluster_id=15, cluster_id=16, cluster_id=17, cluster_id=18, cluster_id=19, cluster_id=20`) ## Token Counts per Source | Dataset Source (Config) | Tokens Contributed | |-------------------------|--------------------| | aimlresearch2023/ClimbMix1M (cluster_id=1) | 4,651,318 | | aimlresearch2023/ClimbMix1M (cluster_id=2) | 7,061,287 | | aimlresearch2023/ClimbMix1M (cluster_id=3) | 8,706,109 | | aimlresearch2023/ClimbMix1M (cluster_id=4) | 20,735,992 | | aimlresearch2023/ClimbMix1M (cluster_id=5) | 10,585,765 | | aimlresearch2023/ClimbMix1M (cluster_id=6) | 104,203,918 | | aimlresearch2023/ClimbMix1M (cluster_id=7) | 102,564,991 | | aimlresearch2023/ClimbMix1M (cluster_id=8) | 6,102,443 | | aimlresearch2023/ClimbMix1M (cluster_id=9) | 4,626,666 | | aimlresearch2023/ClimbMix1M (cluster_id=10) | 43,601,295 | | aimlresearch2023/ClimbMix1M (cluster_id=11) | 7,746,721 | | aimlresearch2023/ClimbMix1M (cluster_id=12) | 126,992,730 | | aimlresearch2023/ClimbMix1M (cluster_id=13) | 4,134,430 | | aimlresearch2023/ClimbMix1M (cluster_id=14) | 1,389,200 | | aimlresearch2023/ClimbMix1M (cluster_id=15) | 1,279,855 | | aimlresearch2023/ClimbMix1M (cluster_id=16) | 43,829,104 | | aimlresearch2023/ClimbMix1M (cluster_id=17) | 43,030,046 | | aimlresearch2023/ClimbMix1M (cluster_id=18) | 11,558,760 | | aimlresearch2023/ClimbMix1M (cluster_id=19) | 6,110,599 | | aimlresearch2023/ClimbMix1M (cluster_id=20) | 2,907,596 | | vm2825/nemotron-cc-v21-Parsed-QA4-filtered-1.7B-evensplit-RQ-8B-CoT-8B1-parts-0-23 (None) | 0 | | pszemraj/simple_wikipedia_LM (default) | 33,519,169 | | mbukowski/wikipedia-summary-dataset (None) | 432,034,242 | | **Total** | **1,027,372,236** |
提供机构:
Ba2han
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作