five

Ba2han/merged_pretraining_2704

收藏
Hugging Face2026-04-28 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/Ba2han/merged_pretraining_2704
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Merged Pretraining 2704 task_categories: - text-generation --- # Ba2han/merged_pretraining_2704 Merged tokenized pretraining dataset. ## Format This dataset contains one column: | Column | Type | |---|---| | `input_ids` | `list<uint32>` | Rows are stored as Parquet shards under `data/`. ## Source datasets | Source dataset | Split | Raw rows read | Rows emitted | Rows skipped | |---|---:|---:|---:|---:| | `Ba2han/tokenized-corpus-0903` | `train` | 20,019,665 | 20,019,665 | 0 | | `Ba2han/warmup_turmix` | `train` | 2,622,676 | 2,622,676 | 0 | | `Ba2han/mixed_corpus-2303` | `train` | 22,647,724 | 22,647,724 | 0 | | `Ba2han/English_mix-tokenized` | `train` | 2,741,081 | 2,741,081 | 0 | | `Ba2han/forumsoh_tokenized` | `train` | 7,639,568 | 7,639,568 | 0 | | `Ba2han/3.5b-tokens` | `train` | 4,848,610 | 4,848,610 | 0 | | `Ba2han/long-corpus` | `train` | 7,418,465 | 7,418,465 | 0 | ## Processing - Streamed each source dataset with `load_dataset(..., streaming=True)`. - Read split: `train`. - Kept only the `input_ids` column. - Validated that each row is a 1D integer token-id sequence. - Normalized token IDs to unsigned 32-bit integers. - Merged rows from all sources as they arrived from parallel workers. - Target shard size: 250,000 rows. - Final row count: 67,937,789. - Final shard count: 272. Note: because sources are streamed in parallel, row ordering across source datasets is intentionally nondeterministic. ## Shards | Path | Rows | |---|---:| | `data/train-000000.parquet` | 250,000 | | `data/train-000001.parquet` | 250,000 | | `data/train-000002.parquet` | 250,000 | | `data/train-000003.parquet` | 250,000 | | `data/train-000004.parquet` | 250,000 | | `data/train-000005.parquet` | 250,000 | | `data/train-000006.parquet` | 250,000 | | `data/train-000007.parquet` | 250,000 | | `data/train-000008.parquet` | 250,000 | | `data/train-000009.parquet` | 250,000 | | `data/train-000010.parquet` | 250,000 | | `data/train-000011.parquet` | 250,000 | | `data/train-000012.parquet` | 250,000 | | `data/train-000013.parquet` | 250,000 | | `data/train-000014.parquet` | 250,000 | | `data/train-000015.parquet` | 250,000 | | `data/train-000016.parquet` | 250,000 | | `data/train-000017.parquet` | 250,000 | | `data/train-000018.parquet` | 250,000 | | `data/train-000019.parquet` | 250,000 | | `data/train-000020.parquet` | 250,000 | | `data/train-000021.parquet` | 250,000 | | `data/train-000022.parquet` | 250,000 | | `data/train-000023.parquet` | 250,000 | | `data/train-000024.parquet` | 250,000 | | `data/train-000025.parquet` | 250,000 | | `data/train-000026.parquet` | 250,000 | | `data/train-000027.parquet` | 250,000 | | `data/train-000028.parquet` | 250,000 | | `data/train-000029.parquet` | 250,000 | | `data/train-000030.parquet` | 250,000 | | `data/train-000031.parquet` | 250,000 | | `data/train-000032.parquet` | 250,000 | | `data/train-000033.parquet` | 250,000 | | `data/train-000034.parquet` | 250,000 | | `data/train-000035.parquet` | 250,000 | | `data/train-000036.parquet` | 250,000 | | `data/train-000037.parquet` | 250,000 | | `data/train-000038.parquet` | 250,000 | | `data/train-000039.parquet` | 250,000 | | `data/train-000040.parquet` | 250,000 | | `data/train-000041.parquet` | 250,000 | | `data/train-000042.parquet` | 250,000 | | `data/train-000043.parquet` | 250,000 | | `data/train-000044.parquet` | 250,000 | | `data/train-000045.parquet` | 250,000 | | `data/train-000046.parquet` | 250,000 | | `data/train-000047.parquet` | 250,000 | | `data/train-000048.parquet` | 250,000 | | `data/train-000049.parquet` | 250,000 | | `data/train-000050.parquet` | 250,000 | | `data/train-000051.parquet` | 250,000 | | `data/train-000052.parquet` | 250,000 | | `data/train-000053.parquet` | 250,000 | | `data/train-000054.parquet` | 250,000 | | `data/train-000055.parquet` | 250,000 | | `data/train-000056.parquet` | 250,000 | | `data/train-000057.parquet` | 250,000 | | `data/train-000058.parquet` | 250,000 | | `data/train-000059.parquet` | 250,000 | | `data/train-000060.parquet` | 250,000 | | `data/train-000061.parquet` | 250,000 | | `data/train-000062.parquet` | 250,000 | | `data/train-000063.parquet` | 250,000 | | `data/train-000064.parquet` | 250,000 | | `data/train-000065.parquet` | 250,000 | | `data/train-000066.parquet` | 250,000 | | `data/train-000067.parquet` | 250,000 | | `data/train-000068.parquet` | 250,000 | | `data/train-000069.parquet` | 250,000 | | `data/train-000070.parquet` | 250,000 | | `data/train-000071.parquet` | 250,000 | | `data/train-000072.parquet` | 250,000 | | `data/train-000073.parquet` | 250,000 | | `data/train-000074.parquet` | 250,000 | | `data/train-000075.parquet` | 250,000 | | `data/train-000076.parquet` | 250,000 | | `data/train-000077.parquet` | 250,000 | | `data/train-000078.parquet` | 250,000 | | `data/train-000079.parquet` | 250,000 | | `data/train-000080.parquet` | 250,000 | | `data/train-000081.parquet` | 250,000 | | `data/train-000082.parquet` | 250,000 | | `data/train-000083.parquet` | 250,000 | | `data/train-000084.parquet` | 250,000 | | `data/train-000085.parquet` | 250,000 | | `data/train-000086.parquet` | 250,000 | | `data/train-000087.parquet` | 250,000 | | `data/train-000088.parquet` | 250,000 | | `data/train-000089.parquet` | 250,000 | | `data/train-000090.parquet` | 250,000 | | `data/train-000091.parquet` | 250,000 | | `data/train-000092.parquet` | 250,000 | | `data/train-000093.parquet` | 250,000 | | `data/train-000094.parquet` | 250,000 | | `data/train-000095.parquet` | 250,000 | | `data/train-000096.parquet` | 250,000 | | `data/train-000097.parquet` | 250,000 | | `data/train-000098.parquet` | 250,000 | | `data/train-000099.parquet` | 250,000 | | `data/train-000100.parquet` | 250,000 | | `data/train-000101.parquet` | 250,000 | | `data/train-000102.parquet` | 250,000 | | `data/train-000103.parquet` | 250,000 | | `data/train-000104.parquet` | 250,000 | | `data/train-000105.parquet` | 250,000 | | `data/train-000106.parquet` | 250,000 | | `data/train-000107.parquet` | 250,000 | | `data/train-000108.parquet` | 250,000 | | `data/train-000109.parquet` | 250,000 | | `data/train-000110.parquet` | 250,000 | | `data/train-000111.parquet` | 250,000 | | `data/train-000112.parquet` | 250,000 | | `data/train-000113.parquet` | 250,000 | | `data/train-000114.parquet` | 250,000 | | `data/train-000115.parquet` | 250,000 | | `data/train-000116.parquet` | 250,000 | | `data/train-000117.parquet` | 250,000 | | `data/train-000118.parquet` | 250,000 | | `data/train-000119.parquet` | 250,000 | | `data/train-000120.parquet` | 250,000 | | `data/train-000121.parquet` | 250,000 | | `data/train-000122.parquet` | 250,000 | | `data/train-000123.parquet` | 250,000 | | `data/train-000124.parquet` | 250,000 | | `data/train-000125.parquet` | 250,000 | | `data/train-000126.parquet` | 250,000 | | `data/train-000127.parquet` | 250,000 | | `data/train-000128.parquet` | 250,000 | | `data/train-000129.parquet` | 250,000 | | `data/train-000130.parquet` | 250,000 | | `data/train-000131.parquet` | 250,000 | | `data/train-000132.parquet` | 250,000 | | `data/train-000133.parquet` | 250,000 | | `data/train-000134.parquet` | 250,000 | | `data/train-000135.parquet` | 250,000 | | `data/train-000136.parquet` | 250,000 | | `data/train-000137.parquet` | 250,000 | | `data/train-000138.parquet` | 250,000 | | `data/train-000139.parquet` | 250,000 | | `data/train-000140.parquet` | 250,000 | | `data/train-000141.parquet` | 250,000 | | `data/train-000142.parquet` | 250,000 | | `data/train-000143.parquet` | 250,000 | | `data/train-000144.parquet` | 250,000 | | `data/train-000145.parquet` | 250,000 | | `data/train-000146.parquet` | 250,000 | | `data/train-000147.parquet` | 250,000 | | `data/train-000148.parquet` | 250,000 | | `data/train-000149.parquet` | 250,000 | | `data/train-000150.parquet` | 250,000 | | `data/train-000151.parquet` | 250,000 | | `data/train-000152.parquet` | 250,000 | | `data/train-000153.parquet` | 250,000 | | `data/train-000154.parquet` | 250,000 | | `data/train-000155.parquet` | 250,000 | | `data/train-000156.parquet` | 250,000 | | `data/train-000157.parquet` | 250,000 | | `data/train-000158.parquet` | 250,000 | | `data/train-000159.parquet` | 250,000 | | `data/train-000160.parquet` | 250,000 | | `data/train-000161.parquet` | 250,000 | | `data/train-000162.parquet` | 250,000 | | `data/train-000163.parquet` | 250,000 | | `data/train-000164.parquet` | 250,000 | | `data/train-000165.parquet` | 250,000 | | `data/train-000166.parquet` | 250,000 | | `data/train-000167.parquet` | 250,000 | | `data/train-000168.parquet` | 250,000 | | `data/train-000169.parquet` | 250,000 | | `data/train-000170.parquet` | 250,000 | | `data/train-000171.parquet` | 250,000 | | `data/train-000172.parquet` | 250,000 | | `data/train-000173.parquet` | 250,000 | | `data/train-000174.parquet` | 250,000 | | `data/train-000175.parquet` | 250,000 | | `data/train-000176.parquet` | 250,000 | | `data/train-000177.parquet` | 250,000 | | `data/train-000178.parquet` | 250,000 | | `data/train-000179.parquet` | 250,000 | | `data/train-000180.parquet` | 250,000 | | `data/train-000181.parquet` | 250,000 | | `data/train-000182.parquet` | 250,000 | | `data/train-000183.parquet` | 250,000 | | `data/train-000184.parquet` | 250,000 | | `data/train-000185.parquet` | 250,000 | | `data/train-000186.parquet` | 250,000 | | `data/train-000187.parquet` | 250,000 | | `data/train-000188.parquet` | 250,000 | | `data/train-000189.parquet` | 250,000 | | `data/train-000190.parquet` | 250,000 | | `data/train-000191.parquet` | 250,000 | | `data/train-000192.parquet` | 250,000 | | `data/train-000193.parquet` | 250,000 | | `data/train-000194.parquet` | 250,000 | | `data/train-000195.parquet` | 250,000 | | `data/train-000196.parquet` | 250,000 | | `data/train-000197.parquet` | 250,000 | | `data/train-000198.parquet` | 250,000 | | `data/train-000199.parquet` | 250,000 | | `data/train-000200.parquet` | 250,000 | | `data/train-000201.parquet` | 250,000 | | `data/train-000202.parquet` | 250,000 | | `data/train-000203.parquet` | 250,000 | | `data/train-000204.parquet` | 250,000 | | `data/train-000205.parquet` | 250,000 | | `data/train-000206.parquet` | 250,000 | | `data/train-000207.parquet` | 250,000 | | `data/train-000208.parquet` | 250,000 | | `data/train-000209.parquet` | 250,000 | | `data/train-000210.parquet` | 250,000 | | `data/train-000211.parquet` | 250,000 | | `data/train-000212.parquet` | 250,000 | | `data/train-000213.parquet` | 250,000 | | `data/train-000214.parquet` | 250,000 | | `data/train-000215.parquet` | 250,000 | | `data/train-000216.parquet` | 250,000 | | `data/train-000217.parquet` | 250,000 | | `data/train-000218.parquet` | 250,000 | | `data/train-000219.parquet` | 250,000 | | `data/train-000220.parquet` | 250,000 | | `data/train-000221.parquet` | 250,000 | | `data/train-000222.parquet` | 250,000 | | `data/train-000223.parquet` | 250,000 | | `data/train-000224.parquet` | 250,000 | | `data/train-000225.parquet` | 250,000 | | `data/train-000226.parquet` | 250,000 | | `data/train-000227.parquet` | 250,000 | | `data/train-000228.parquet` | 250,000 | | `data/train-000229.parquet` | 250,000 | | `data/train-000230.parquet` | 250,000 | | `data/train-000231.parquet` | 250,000 | | `data/train-000232.parquet` | 250,000 | | `data/train-000233.parquet` | 250,000 | | `data/train-000234.parquet` | 250,000 | | `data/train-000235.parquet` | 250,000 | | `data/train-000236.parquet` | 250,000 | | `data/train-000237.parquet` | 250,000 | | `data/train-000238.parquet` | 250,000 | | `data/train-000239.parquet` | 250,000 | | `data/train-000240.parquet` | 250,000 | | `data/train-000241.parquet` | 250,000 | | `data/train-000242.parquet` | 250,000 | | `data/train-000243.parquet` | 250,000 | | `data/train-000244.parquet` | 250,000 | | `data/train-000245.parquet` | 250,000 | | `data/train-000246.parquet` | 250,000 | | `data/train-000247.parquet` | 250,000 | | `data/train-000248.parquet` | 250,000 | | `data/train-000249.parquet` | 250,000 | | `data/train-000250.parquet` | 250,000 | | `data/train-000251.parquet` | 250,000 | | `data/train-000252.parquet` | 250,000 | | `data/train-000253.parquet` | 250,000 | | `data/train-000254.parquet` | 250,000 | | `data/train-000255.parquet` | 250,000 | | `data/train-000256.parquet` | 250,000 | | `data/train-000257.parquet` | 250,000 | | `data/train-000258.parquet` | 250,000 | | `data/train-000259.parquet` | 250,000 | | `data/train-000260.parquet` | 250,000 | | `data/train-000261.parquet` | 250,000 | | `data/train-000262.parquet` | 250,000 | | `data/train-000263.parquet` | 250,000 | | `data/train-000264.parquet` | 250,000 | | `data/train-000265.parquet` | 250,000 | | `data/train-000266.parquet` | 250,000 | | `data/train-000267.parquet` | 250,000 | | `data/train-000268.parquet` | 250,000 | | `data/train-000269.parquet` | 250,000 | | `data/train-000270.parquet` | 250,000 | | `data/train-000271.parquet` | 187,789 |
提供机构:
Ba2han
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作