five

Ba2han/mixed_corpus-2303

收藏
Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Ba2han/mixed_corpus-2303
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: input_ids sequence: uint16 --- # Mixed Corpus 2303 This dataset is a shuffled combination of the following tokenized datasets. Only the `input_ids` column has been retained, and the values have been cast to `uint16` to save space. ## Token Amounts | Source Dataset | Token Count | | :--- | :--- | | [Ba2han/forumsoh_tokenized](https://huggingface.co/datasets/Ba2han/forumsoh_tokenized) | 4,232,493,403 | | [Ba2han/3.5b-tokens](https://huggingface.co/datasets/Ba2han/3.5b-tokens) | 2,879,132,572 | | [Ba2han/long-corpus](https://huggingface.co/datasets/Ba2han/long-corpus) | 6,072,626,857 | | [Ba2han/English_mix-tokenized](https://huggingface.co/datasets/Ba2han/English_mix-tokenized) | 1,027,372,236 | | **Total** | **14,211,625,068** |
提供机构:
Ba2han
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作