Ba2han/mixed_corpus-2303
收藏Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Ba2han/mixed_corpus-2303
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: input_ids
sequence: uint16
---
# Mixed Corpus 2303
This dataset is a shuffled combination of the following tokenized datasets.
Only the `input_ids` column has been retained, and the values have been cast to `uint16` to save space.
## Token Amounts
| Source Dataset | Token Count |
| :--- | :--- |
| [Ba2han/forumsoh_tokenized](https://huggingface.co/datasets/Ba2han/forumsoh_tokenized) | 4,232,493,403 |
| [Ba2han/3.5b-tokens](https://huggingface.co/datasets/Ba2han/3.5b-tokens) | 2,879,132,572 |
| [Ba2han/long-corpus](https://huggingface.co/datasets/Ba2han/long-corpus) | 6,072,626,857 |
| [Ba2han/English_mix-tokenized](https://huggingface.co/datasets/Ba2han/English_mix-tokenized) | 1,027,372,236 |
| **Total** | **14,211,625,068** |
提供机构:
Ba2han



