five

Ba2han/merged_pt_2404

收藏
Hugging Face2026-04-25 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/Ba2han/merged_pt_2404
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: other task_categories: - text-generation language: - tr - en tags: - pretraining - merged - input_ids size_categories: - n>10B --- # Ba2han/merged_pt_2404 Merged pre-training corpus assembled on **2026-04-25** from 7 source datasets. Only the `input_ids` column is stored, cast to **`uint16`** for storage efficiency. > ⚠️ `uint16` has a maximum value of 65535. > Token IDs above this threshold will overflow. > Switch `CAST_DTYPE` in the build script to `uint16` or `uint32` if your vocabulary is larger. ## Token counts by source | Source dataset | Tokens | Share | Cumulative | |:---|---:|---:|---:| | [tokenized-corpus-0903](https://huggingface.co/datasets/Ba2han/tokenized-corpus-0903) | 5,216,849,173 | 14.73% | 5,216,849,173 | | [warmup_turmix](https://huggingface.co/datasets/Ba2han/warmup_turmix) | 1,771,460,711 | 5.00% | 6,988,309,884 | | [mixed_corpus-2303](https://huggingface.co/datasets/Ba2han/mixed_corpus-2303) | 14,211,625,068 | 40.13% | 21,199,934,952 | | [English_mix-tokenized](https://huggingface.co/datasets/Ba2han/English_mix-tokenized) | 1,027,372,236 | 2.90% | 22,227,307,188 | | [forumsoh_tokenized](https://huggingface.co/datasets/Ba2han/forumsoh_tokenized) | 4,232,493,403 | 11.95% | 26,459,800,591 | | [3.5b-tokens](https://huggingface.co/datasets/Ba2han/3.5b-tokens) | 2,879,132,572 | 8.13% | 29,338,933,163 | | [long-corpus](https://huggingface.co/datasets/Ba2han/long-corpus) | 6,072,626,857 | 17.15% | 35,411,560,020 | | **TOTAL** | ** 35,411,560,020** | **100.00%** | ** 35,411,560,020** | ## Summary | Metric | Value | |:---|---:| | Total tokens | **35.41 B** (35,411,560,020) | | Source datasets | 7 | | Parquet shards | 473 (~74.87 M tokens/shard) | | Shard format | Zstandard-compressed Parquet | | Token dtype | `uint16` | | Column | `input_ids` (list of `uint16`) | | Build date | 2026-04-25 | ## Source datasets - [Ba2han/tokenized-corpus-0903](https://huggingface.co/datasets/Ba2han/tokenized-corpus-0903) - [Ba2han/warmup_turmix](https://huggingface.co/datasets/Ba2han/warmup_turmix) - [Ba2han/mixed_corpus-2303](https://huggingface.co/datasets/Ba2han/mixed_corpus-2303) - [Ba2han/English_mix-tokenized](https://huggingface.co/datasets/Ba2han/English_mix-tokenized) - [Ba2han/forumsoh_tokenized](https://huggingface.co/datasets/Ba2han/forumsoh_tokenized) - [Ba2han/3.5b-tokens](https://huggingface.co/datasets/Ba2han/3.5b-tokens) - [Ba2han/long-corpus](https://huggingface.co/datasets/Ba2han/long-corpus) ## Reproduction ```bash pip install datasets huggingface_hub tqdm pyarrow huggingface-cli login python merge_and_push.py ``` ## License Inherited from source datasets. Check each source repository for its individual license.
提供机构:
Ba2han
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作