Ba2han/merged_pretraining_2704
收藏Hugging Face2026-04-28 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/Ba2han/merged_pretraining_2704
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Merged Pretraining 2704
task_categories:
- text-generation
---
# Ba2han/merged_pretraining_2704
Merged tokenized pretraining dataset.
## Format
This dataset contains one column:
| Column | Type |
|---|---|
| `input_ids` | `list<uint32>` |
Rows are stored as Parquet shards under `data/`.
## Source datasets
| Source dataset | Split | Raw rows read | Rows emitted | Rows skipped |
|---|---:|---:|---:|---:|
| `Ba2han/tokenized-corpus-0903` | `train` | 20,019,665 | 20,019,665 | 0 |
| `Ba2han/warmup_turmix` | `train` | 2,622,676 | 2,622,676 | 0 |
| `Ba2han/mixed_corpus-2303` | `train` | 22,647,724 | 22,647,724 | 0 |
| `Ba2han/English_mix-tokenized` | `train` | 2,741,081 | 2,741,081 | 0 |
| `Ba2han/forumsoh_tokenized` | `train` | 7,639,568 | 7,639,568 | 0 |
| `Ba2han/3.5b-tokens` | `train` | 4,848,610 | 4,848,610 | 0 |
| `Ba2han/long-corpus` | `train` | 7,418,465 | 7,418,465 | 0 |
## Processing
- Streamed each source dataset with `load_dataset(..., streaming=True)`.
- Read split: `train`.
- Kept only the `input_ids` column.
- Validated that each row is a 1D integer token-id sequence.
- Normalized token IDs to unsigned 32-bit integers.
- Merged rows from all sources as they arrived from parallel workers.
- Target shard size: 250,000 rows.
- Final row count: 67,937,789.
- Final shard count: 272.
Note: because sources are streamed in parallel, row ordering across source datasets is intentionally nondeterministic.
## Shards
| Path | Rows |
|---|---:|
| `data/train-000000.parquet` | 250,000 |
| `data/train-000001.parquet` | 250,000 |
| `data/train-000002.parquet` | 250,000 |
| `data/train-000003.parquet` | 250,000 |
| `data/train-000004.parquet` | 250,000 |
| `data/train-000005.parquet` | 250,000 |
| `data/train-000006.parquet` | 250,000 |
| `data/train-000007.parquet` | 250,000 |
| `data/train-000008.parquet` | 250,000 |
| `data/train-000009.parquet` | 250,000 |
| `data/train-000010.parquet` | 250,000 |
| `data/train-000011.parquet` | 250,000 |
| `data/train-000012.parquet` | 250,000 |
| `data/train-000013.parquet` | 250,000 |
| `data/train-000014.parquet` | 250,000 |
| `data/train-000015.parquet` | 250,000 |
| `data/train-000016.parquet` | 250,000 |
| `data/train-000017.parquet` | 250,000 |
| `data/train-000018.parquet` | 250,000 |
| `data/train-000019.parquet` | 250,000 |
| `data/train-000020.parquet` | 250,000 |
| `data/train-000021.parquet` | 250,000 |
| `data/train-000022.parquet` | 250,000 |
| `data/train-000023.parquet` | 250,000 |
| `data/train-000024.parquet` | 250,000 |
| `data/train-000025.parquet` | 250,000 |
| `data/train-000026.parquet` | 250,000 |
| `data/train-000027.parquet` | 250,000 |
| `data/train-000028.parquet` | 250,000 |
| `data/train-000029.parquet` | 250,000 |
| `data/train-000030.parquet` | 250,000 |
| `data/train-000031.parquet` | 250,000 |
| `data/train-000032.parquet` | 250,000 |
| `data/train-000033.parquet` | 250,000 |
| `data/train-000034.parquet` | 250,000 |
| `data/train-000035.parquet` | 250,000 |
| `data/train-000036.parquet` | 250,000 |
| `data/train-000037.parquet` | 250,000 |
| `data/train-000038.parquet` | 250,000 |
| `data/train-000039.parquet` | 250,000 |
| `data/train-000040.parquet` | 250,000 |
| `data/train-000041.parquet` | 250,000 |
| `data/train-000042.parquet` | 250,000 |
| `data/train-000043.parquet` | 250,000 |
| `data/train-000044.parquet` | 250,000 |
| `data/train-000045.parquet` | 250,000 |
| `data/train-000046.parquet` | 250,000 |
| `data/train-000047.parquet` | 250,000 |
| `data/train-000048.parquet` | 250,000 |
| `data/train-000049.parquet` | 250,000 |
| `data/train-000050.parquet` | 250,000 |
| `data/train-000051.parquet` | 250,000 |
| `data/train-000052.parquet` | 250,000 |
| `data/train-000053.parquet` | 250,000 |
| `data/train-000054.parquet` | 250,000 |
| `data/train-000055.parquet` | 250,000 |
| `data/train-000056.parquet` | 250,000 |
| `data/train-000057.parquet` | 250,000 |
| `data/train-000058.parquet` | 250,000 |
| `data/train-000059.parquet` | 250,000 |
| `data/train-000060.parquet` | 250,000 |
| `data/train-000061.parquet` | 250,000 |
| `data/train-000062.parquet` | 250,000 |
| `data/train-000063.parquet` | 250,000 |
| `data/train-000064.parquet` | 250,000 |
| `data/train-000065.parquet` | 250,000 |
| `data/train-000066.parquet` | 250,000 |
| `data/train-000067.parquet` | 250,000 |
| `data/train-000068.parquet` | 250,000 |
| `data/train-000069.parquet` | 250,000 |
| `data/train-000070.parquet` | 250,000 |
| `data/train-000071.parquet` | 250,000 |
| `data/train-000072.parquet` | 250,000 |
| `data/train-000073.parquet` | 250,000 |
| `data/train-000074.parquet` | 250,000 |
| `data/train-000075.parquet` | 250,000 |
| `data/train-000076.parquet` | 250,000 |
| `data/train-000077.parquet` | 250,000 |
| `data/train-000078.parquet` | 250,000 |
| `data/train-000079.parquet` | 250,000 |
| `data/train-000080.parquet` | 250,000 |
| `data/train-000081.parquet` | 250,000 |
| `data/train-000082.parquet` | 250,000 |
| `data/train-000083.parquet` | 250,000 |
| `data/train-000084.parquet` | 250,000 |
| `data/train-000085.parquet` | 250,000 |
| `data/train-000086.parquet` | 250,000 |
| `data/train-000087.parquet` | 250,000 |
| `data/train-000088.parquet` | 250,000 |
| `data/train-000089.parquet` | 250,000 |
| `data/train-000090.parquet` | 250,000 |
| `data/train-000091.parquet` | 250,000 |
| `data/train-000092.parquet` | 250,000 |
| `data/train-000093.parquet` | 250,000 |
| `data/train-000094.parquet` | 250,000 |
| `data/train-000095.parquet` | 250,000 |
| `data/train-000096.parquet` | 250,000 |
| `data/train-000097.parquet` | 250,000 |
| `data/train-000098.parquet` | 250,000 |
| `data/train-000099.parquet` | 250,000 |
| `data/train-000100.parquet` | 250,000 |
| `data/train-000101.parquet` | 250,000 |
| `data/train-000102.parquet` | 250,000 |
| `data/train-000103.parquet` | 250,000 |
| `data/train-000104.parquet` | 250,000 |
| `data/train-000105.parquet` | 250,000 |
| `data/train-000106.parquet` | 250,000 |
| `data/train-000107.parquet` | 250,000 |
| `data/train-000108.parquet` | 250,000 |
| `data/train-000109.parquet` | 250,000 |
| `data/train-000110.parquet` | 250,000 |
| `data/train-000111.parquet` | 250,000 |
| `data/train-000112.parquet` | 250,000 |
| `data/train-000113.parquet` | 250,000 |
| `data/train-000114.parquet` | 250,000 |
| `data/train-000115.parquet` | 250,000 |
| `data/train-000116.parquet` | 250,000 |
| `data/train-000117.parquet` | 250,000 |
| `data/train-000118.parquet` | 250,000 |
| `data/train-000119.parquet` | 250,000 |
| `data/train-000120.parquet` | 250,000 |
| `data/train-000121.parquet` | 250,000 |
| `data/train-000122.parquet` | 250,000 |
| `data/train-000123.parquet` | 250,000 |
| `data/train-000124.parquet` | 250,000 |
| `data/train-000125.parquet` | 250,000 |
| `data/train-000126.parquet` | 250,000 |
| `data/train-000127.parquet` | 250,000 |
| `data/train-000128.parquet` | 250,000 |
| `data/train-000129.parquet` | 250,000 |
| `data/train-000130.parquet` | 250,000 |
| `data/train-000131.parquet` | 250,000 |
| `data/train-000132.parquet` | 250,000 |
| `data/train-000133.parquet` | 250,000 |
| `data/train-000134.parquet` | 250,000 |
| `data/train-000135.parquet` | 250,000 |
| `data/train-000136.parquet` | 250,000 |
| `data/train-000137.parquet` | 250,000 |
| `data/train-000138.parquet` | 250,000 |
| `data/train-000139.parquet` | 250,000 |
| `data/train-000140.parquet` | 250,000 |
| `data/train-000141.parquet` | 250,000 |
| `data/train-000142.parquet` | 250,000 |
| `data/train-000143.parquet` | 250,000 |
| `data/train-000144.parquet` | 250,000 |
| `data/train-000145.parquet` | 250,000 |
| `data/train-000146.parquet` | 250,000 |
| `data/train-000147.parquet` | 250,000 |
| `data/train-000148.parquet` | 250,000 |
| `data/train-000149.parquet` | 250,000 |
| `data/train-000150.parquet` | 250,000 |
| `data/train-000151.parquet` | 250,000 |
| `data/train-000152.parquet` | 250,000 |
| `data/train-000153.parquet` | 250,000 |
| `data/train-000154.parquet` | 250,000 |
| `data/train-000155.parquet` | 250,000 |
| `data/train-000156.parquet` | 250,000 |
| `data/train-000157.parquet` | 250,000 |
| `data/train-000158.parquet` | 250,000 |
| `data/train-000159.parquet` | 250,000 |
| `data/train-000160.parquet` | 250,000 |
| `data/train-000161.parquet` | 250,000 |
| `data/train-000162.parquet` | 250,000 |
| `data/train-000163.parquet` | 250,000 |
| `data/train-000164.parquet` | 250,000 |
| `data/train-000165.parquet` | 250,000 |
| `data/train-000166.parquet` | 250,000 |
| `data/train-000167.parquet` | 250,000 |
| `data/train-000168.parquet` | 250,000 |
| `data/train-000169.parquet` | 250,000 |
| `data/train-000170.parquet` | 250,000 |
| `data/train-000171.parquet` | 250,000 |
| `data/train-000172.parquet` | 250,000 |
| `data/train-000173.parquet` | 250,000 |
| `data/train-000174.parquet` | 250,000 |
| `data/train-000175.parquet` | 250,000 |
| `data/train-000176.parquet` | 250,000 |
| `data/train-000177.parquet` | 250,000 |
| `data/train-000178.parquet` | 250,000 |
| `data/train-000179.parquet` | 250,000 |
| `data/train-000180.parquet` | 250,000 |
| `data/train-000181.parquet` | 250,000 |
| `data/train-000182.parquet` | 250,000 |
| `data/train-000183.parquet` | 250,000 |
| `data/train-000184.parquet` | 250,000 |
| `data/train-000185.parquet` | 250,000 |
| `data/train-000186.parquet` | 250,000 |
| `data/train-000187.parquet` | 250,000 |
| `data/train-000188.parquet` | 250,000 |
| `data/train-000189.parquet` | 250,000 |
| `data/train-000190.parquet` | 250,000 |
| `data/train-000191.parquet` | 250,000 |
| `data/train-000192.parquet` | 250,000 |
| `data/train-000193.parquet` | 250,000 |
| `data/train-000194.parquet` | 250,000 |
| `data/train-000195.parquet` | 250,000 |
| `data/train-000196.parquet` | 250,000 |
| `data/train-000197.parquet` | 250,000 |
| `data/train-000198.parquet` | 250,000 |
| `data/train-000199.parquet` | 250,000 |
| `data/train-000200.parquet` | 250,000 |
| `data/train-000201.parquet` | 250,000 |
| `data/train-000202.parquet` | 250,000 |
| `data/train-000203.parquet` | 250,000 |
| `data/train-000204.parquet` | 250,000 |
| `data/train-000205.parquet` | 250,000 |
| `data/train-000206.parquet` | 250,000 |
| `data/train-000207.parquet` | 250,000 |
| `data/train-000208.parquet` | 250,000 |
| `data/train-000209.parquet` | 250,000 |
| `data/train-000210.parquet` | 250,000 |
| `data/train-000211.parquet` | 250,000 |
| `data/train-000212.parquet` | 250,000 |
| `data/train-000213.parquet` | 250,000 |
| `data/train-000214.parquet` | 250,000 |
| `data/train-000215.parquet` | 250,000 |
| `data/train-000216.parquet` | 250,000 |
| `data/train-000217.parquet` | 250,000 |
| `data/train-000218.parquet` | 250,000 |
| `data/train-000219.parquet` | 250,000 |
| `data/train-000220.parquet` | 250,000 |
| `data/train-000221.parquet` | 250,000 |
| `data/train-000222.parquet` | 250,000 |
| `data/train-000223.parquet` | 250,000 |
| `data/train-000224.parquet` | 250,000 |
| `data/train-000225.parquet` | 250,000 |
| `data/train-000226.parquet` | 250,000 |
| `data/train-000227.parquet` | 250,000 |
| `data/train-000228.parquet` | 250,000 |
| `data/train-000229.parquet` | 250,000 |
| `data/train-000230.parquet` | 250,000 |
| `data/train-000231.parquet` | 250,000 |
| `data/train-000232.parquet` | 250,000 |
| `data/train-000233.parquet` | 250,000 |
| `data/train-000234.parquet` | 250,000 |
| `data/train-000235.parquet` | 250,000 |
| `data/train-000236.parquet` | 250,000 |
| `data/train-000237.parquet` | 250,000 |
| `data/train-000238.parquet` | 250,000 |
| `data/train-000239.parquet` | 250,000 |
| `data/train-000240.parquet` | 250,000 |
| `data/train-000241.parquet` | 250,000 |
| `data/train-000242.parquet` | 250,000 |
| `data/train-000243.parquet` | 250,000 |
| `data/train-000244.parquet` | 250,000 |
| `data/train-000245.parquet` | 250,000 |
| `data/train-000246.parquet` | 250,000 |
| `data/train-000247.parquet` | 250,000 |
| `data/train-000248.parquet` | 250,000 |
| `data/train-000249.parquet` | 250,000 |
| `data/train-000250.parquet` | 250,000 |
| `data/train-000251.parquet` | 250,000 |
| `data/train-000252.parquet` | 250,000 |
| `data/train-000253.parquet` | 250,000 |
| `data/train-000254.parquet` | 250,000 |
| `data/train-000255.parquet` | 250,000 |
| `data/train-000256.parquet` | 250,000 |
| `data/train-000257.parquet` | 250,000 |
| `data/train-000258.parquet` | 250,000 |
| `data/train-000259.parquet` | 250,000 |
| `data/train-000260.parquet` | 250,000 |
| `data/train-000261.parquet` | 250,000 |
| `data/train-000262.parquet` | 250,000 |
| `data/train-000263.parquet` | 250,000 |
| `data/train-000264.parquet` | 250,000 |
| `data/train-000265.parquet` | 250,000 |
| `data/train-000266.parquet` | 250,000 |
| `data/train-000267.parquet` | 250,000 |
| `data/train-000268.parquet` | 250,000 |
| `data/train-000269.parquet` | 250,000 |
| `data/train-000270.parquet` | 250,000 |
| `data/train-000271.parquet` | 187,789 |
提供机构:
Ba2han



