Ba2han/merged_pt_2404
收藏Hugging Face2026-04-25 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/Ba2han/merged_pt_2404
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
task_categories:
- text-generation
language:
- tr
- en
tags:
- pretraining
- merged
- input_ids
size_categories:
- n>10B
---
# Ba2han/merged_pt_2404
Merged pre-training corpus assembled on **2026-04-25** from 7 source datasets.
Only the `input_ids` column is stored, cast to **`uint16`** for storage efficiency.
> ⚠️ `uint16` has a maximum value of 65535.
> Token IDs above this threshold will overflow.
> Switch `CAST_DTYPE` in the build script to `uint16` or `uint32` if your vocabulary is larger.
## Token counts by source
| Source dataset | Tokens | Share | Cumulative |
|:---|---:|---:|---:|
| [tokenized-corpus-0903](https://huggingface.co/datasets/Ba2han/tokenized-corpus-0903) | 5,216,849,173 | 14.73% | 5,216,849,173 |
| [warmup_turmix](https://huggingface.co/datasets/Ba2han/warmup_turmix) | 1,771,460,711 | 5.00% | 6,988,309,884 |
| [mixed_corpus-2303](https://huggingface.co/datasets/Ba2han/mixed_corpus-2303) | 14,211,625,068 | 40.13% | 21,199,934,952 |
| [English_mix-tokenized](https://huggingface.co/datasets/Ba2han/English_mix-tokenized) | 1,027,372,236 | 2.90% | 22,227,307,188 |
| [forumsoh_tokenized](https://huggingface.co/datasets/Ba2han/forumsoh_tokenized) | 4,232,493,403 | 11.95% | 26,459,800,591 |
| [3.5b-tokens](https://huggingface.co/datasets/Ba2han/3.5b-tokens) | 2,879,132,572 | 8.13% | 29,338,933,163 |
| [long-corpus](https://huggingface.co/datasets/Ba2han/long-corpus) | 6,072,626,857 | 17.15% | 35,411,560,020 |
| **TOTAL** | ** 35,411,560,020** | **100.00%** | ** 35,411,560,020** |
## Summary
| Metric | Value |
|:---|---:|
| Total tokens | **35.41 B** (35,411,560,020) |
| Source datasets | 7 |
| Parquet shards | 473 (~74.87 M tokens/shard) |
| Shard format | Zstandard-compressed Parquet |
| Token dtype | `uint16` |
| Column | `input_ids` (list of `uint16`) |
| Build date | 2026-04-25 |
## Source datasets
- [Ba2han/tokenized-corpus-0903](https://huggingface.co/datasets/Ba2han/tokenized-corpus-0903)
- [Ba2han/warmup_turmix](https://huggingface.co/datasets/Ba2han/warmup_turmix)
- [Ba2han/mixed_corpus-2303](https://huggingface.co/datasets/Ba2han/mixed_corpus-2303)
- [Ba2han/English_mix-tokenized](https://huggingface.co/datasets/Ba2han/English_mix-tokenized)
- [Ba2han/forumsoh_tokenized](https://huggingface.co/datasets/Ba2han/forumsoh_tokenized)
- [Ba2han/3.5b-tokens](https://huggingface.co/datasets/Ba2han/3.5b-tokens)
- [Ba2han/long-corpus](https://huggingface.co/datasets/Ba2han/long-corpus)
## Reproduction
```bash
pip install datasets huggingface_hub tqdm pyarrow
huggingface-cli login
python merge_and_push.py
```
## License
Inherited from source datasets. Check each source repository for its individual license.
提供机构:
Ba2han



