Ba2han/tokenized-corpus-0603
收藏Hugging Face2026-03-07 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Ba2han/tokenized-corpus-0603
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: input_ids
sequence: int32
- name: attention_mask
sequence: int32
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# Tokenized Corpus Statistics
| Shard | Total Tokens | Avg Tokens | Median Tokens |
|-------|--------------|------------|---------------|
| train-00000-of-00077 | 64,973,050 | 259.89 | 213.0 |
| train-00001-of-00077 | 65,024,877 | 260.10 | 214.0 |
| train-00002-of-00077 | 64,976,757 | 259.91 | 213.0 |
| train-00003-of-00077 | 65,035,795 | 260.14 | 214.0 |
| train-00004-of-00077 | 64,962,798 | 259.85 | 214.0 |
| train-00005-of-00077 | 64,945,077 | 259.78 | 213.0 |
| train-00006-of-00077 | 64,885,161 | 259.54 | 213.0 |
| train-00007-of-00077 | 65,060,639 | 260.24 | 214.0 |
| train-00008-of-00077 | 64,905,435 | 259.62 | 213.0 |
| train-00009-of-00077 | 65,040,770 | 260.16 | 214.0 |
| train-00010-of-00077 | 64,912,129 | 259.65 | 213.0 |
| train-00011-of-00077 | 64,974,978 | 259.90 | 214.0 |
| train-00012-of-00077 | 64,973,804 | 259.90 | 213.0 |
| train-00013-of-00077 | 64,913,694 | 259.65 | 213.0 |
| train-00014-of-00077 | 64,906,173 | 259.62 | 213.0 |
| train-00015-of-00077 | 64,947,254 | 259.79 | 214.0 |
| train-00016-of-00077 | 64,808,665 | 259.23 | 212.0 |
| train-00017-of-00077 | 64,772,849 | 259.09 | 212.0 |
| train-00018-of-00077 | 65,005,488 | 260.02 | 214.0 |
| train-00019-of-00077 | 65,010,312 | 260.04 | 214.0 |
| train-00020-of-00077 | 64,976,998 | 259.91 | 213.0 |
| train-00021-of-00077 | 64,988,846 | 259.96 | 214.0 |
| train-00022-of-00077 | 64,950,899 | 259.80 | 214.0 |
| train-00023-of-00077 | 64,956,439 | 259.83 | 213.0 |
| train-00024-of-00077 | 65,146,074 | 260.58 | 214.0 |
| train-00025-of-00077 | 65,174,108 | 260.70 | 214.0 |
| train-00026-of-00077 | 64,939,341 | 259.76 | 213.0 |
| train-00027-of-00077 | 65,156,021 | 260.62 | 214.0 |
| train-00028-of-00077 | 64,775,060 | 259.10 | 213.0 |
| train-00029-of-00077 | 64,961,138 | 259.84 | 213.0 |
| train-00030-of-00077 | 64,988,300 | 259.95 | 213.0 |
| train-00031-of-00077 | 64,945,869 | 259.78 | 213.0 |
| train-00032-of-00077 | 64,893,235 | 259.57 | 213.0 |
| train-00033-of-00077 | 64,927,042 | 259.71 | 214.0 |
| train-00034-of-00077 | 65,101,375 | 260.41 | 214.0 |
| train-00035-of-00077 | 65,031,864 | 260.13 | 214.0 |
| train-00036-of-00077 | 65,111,482 | 260.45 | 214.0 |
| train-00037-of-00077 | 64,860,636 | 259.44 | 213.0 |
| train-00038-of-00077 | 65,087,344 | 260.35 | 214.0 |
| train-00039-of-00077 | 65,038,176 | 260.15 | 214.0 |
| train-00040-of-00077 | 65,062,299 | 260.25 | 214.0 |
| train-00041-of-00077 | 64,868,891 | 259.48 | 213.0 |
| train-00042-of-00077 | 65,139,940 | 260.56 | 214.0 |
| train-00043-of-00077 | 65,003,795 | 260.02 | 213.0 |
| train-00044-of-00077 | 65,138,572 | 260.55 | 215.0 |
| train-00045-of-00077 | 65,021,642 | 260.09 | 214.0 |
| train-00046-of-00077 | 65,014,977 | 260.06 | 214.0 |
| train-00047-of-00077 | 64,943,326 | 259.77 | 213.0 |
| train-00048-of-00077 | 64,953,491 | 259.81 | 214.0 |
| train-00049-of-00077 | 64,863,204 | 259.45 | 213.0 |
| train-00050-of-00077 | 64,741,302 | 258.97 | 213.0 |
| train-00051-of-00077 | 65,125,371 | 260.50 | 214.0 |
| train-00052-of-00077 | 64,968,804 | 259.88 | 214.0 |
| train-00053-of-00077 | 65,125,921 | 260.50 | 214.0 |
| train-00054-of-00077 | 64,974,042 | 259.90 | 214.0 |
| train-00055-of-00077 | 64,990,107 | 259.96 | 213.0 |
| train-00056-of-00077 | 64,826,847 | 259.31 | 213.0 |
| train-00057-of-00077 | 64,877,379 | 259.51 | 213.0 |
| train-00058-of-00077 | 64,888,841 | 259.56 | 213.0 |
| train-00059-of-00077 | 65,076,924 | 260.31 | 213.0 |
| train-00060-of-00077 | 64,894,418 | 259.58 | 213.0 |
| train-00061-of-00077 | 65,030,084 | 260.12 | 214.0 |
| train-00062-of-00077 | 64,883,522 | 259.53 | 213.0 |
| train-00063-of-00077 | 64,996,849 | 259.99 | 213.0 |
| train-00064-of-00077 | 64,954,207 | 259.82 | 213.0 |
| train-00065-of-00077 | 64,868,166 | 259.47 | 213.0 |
| train-00066-of-00077 | 64,961,856 | 259.85 | 213.0 |
| train-00067-of-00077 | 65,027,897 | 260.11 | 213.0 |
| train-00068-of-00077 | 64,969,800 | 259.88 | 214.0 |
| train-00069-of-00077 | 65,051,706 | 260.21 | 214.0 |
| train-00070-of-00077 | 64,987,045 | 259.95 | 214.0 |
| train-00071-of-00077 | 64,952,257 | 259.81 | 213.0 |
| train-00071-of-00077 | 64,952,257 | 259.81 | 213.0 |
| train-00072-of-00077 | 64,979,962 | 259.92 | 213.0 |
| train-00073-of-00077 | 65,128,788 | 260.52 | 214.0 |
| train-00074-of-00077 | 65,107,412 | 260.43 | 214.0 |
| train-00075-of-00077 | 65,089,811 | 260.36 | 214.0 |
| train-00076-of-00077 | 6,154,579 | 260.77 | 217.0 |
数据集信息:
特征:
- 名称:input_ids,序列为32位整数
- 名称:attention_mask,序列为32位整数
配置项:
- 配置名称:default
数据文件:
- 拆分集:train(训练集)
- 路径:data/train-*
# 分词语料库统计信息
| 分片 | 总Token数 | 平均Token数 | Token数中位数 |
|-------|--------------|------------|---------------|
| train-00000-of-00077 | 64,973,050 | 259.89 | 213.0 |
| train-00001-of-00077 | 65,024,877 | 260.10 | 214.0 |
| train-00002-of-00077 | 64,976,757 | 259.91 | 213.0 |
| train-00003-of-00077 | 65,035,795 | 260.14 | 214.0 |
| train-00004-of-00077 | 64,962,798 | 259.85 | 214.0 |
| train-00005-of-00077 | 64,945,077 | 259.78 | 213.0 |
| train-00006-of-00077 | 64,885,161 | 259.54 | 213.0 |
| train-00007-of-00077 | 65,060,639 | 260.24 | 214.0 |
| train-00008-of-00077 | 64,905,435 | 259.62 | 213.0 |
| train-00009-of-00077 | 65,040,770 | 260.16 | 214.0 |
| train-00010-of-00077 | 64,912,129 | 259.65 | 213.0 |
| train-00011-of-00077 | 64,974,978 | 259.90 | 214.0 |
| train-00012-of-00077 | 64,973,804 | 259.90 | 213.0 |
| train-00013-of-00077 | 64,913,694 | 259.65 | 213.0 |
| train-00014-of-00077 | 64,906,173 | 259.62 | 213.0 |
| train-00015-of-00077 | 64,947,254 | 259.79 | 214.0 |
| train-00016-of-00077 | 64,808,665 | 259.23 | 212.0 |
| train-00017-of-00077 | 64,772,849 | 259.09 | 212.0 |
| train-00018-of-00077 | 65,005,488 | 260.02 | 214.0 |
| train-00019-of-00077 | 65,010,312 | 260.04 | 214.0 |
| train-00020-of-00077 | 64,976,998 | 259.91 | 213.0 |
| train-00021-of-00077 | 64,988,846 | 259.96 | 214.0 |
| train-00022-of-00077 | 64,950,899 | 259.80 | 214.0 |
| train-00023-of-00077 | 64,956,439 | 259.83 | 213.0 |
| train-00024-of-00077 | 65,146,074 | 260.58 | 214.0 |
| train-00025-of-00077 | 65,174,108 | 260.70 | 214.0 |
| train-00026-of-00077 | 64,939,341 | 259.76 | 213.0 |
| train-00027-of-00077 | 65,156,021 | 260.62 | 214.0 |
| train-00028-of-00077 | 64,775,060 | 259.10 | 213.0 |
| train-00029-of-00077 | 64,961,138 | 259.84 | 213.0 |
| train-00030-of-00077 | 64,988,300 | 259.95 | 213.0 |
| train-00031-of-00077 | 64,945,869 | 259.78 | 213.0 |
| train-00032-of-00077 | 64,893,235 | 259.57 | 213.0 |
| train-00033-of-00077 | 64,927,042 | 259.71 | 214.0 |
| train-00034-of-00077 | 65,101,375 | 260.41 | 214.0 |
| train-00035-of-00077 | 65,031,864 | 260.13 | 214.0 |
| train-00036-of-00077 | 65,111,482 | 260.45 | 214.0 |
| train-00037-of-00077 | 64,860,636 | 259.44 | 213.0 |
| train-00038-of-00077 | 65,087,344 | 260.35 | 214.0 |
| train-00039-of-00077 | 65,038,176 | 260.15 | 214.0 |
| train-00040-of-00077 | 65,062,299 | 260.25 | 214.0 |
| train-00041-of-00077 | 64,868,891 | 259.48 | 213.0 |
| train-00042-of-00077 | 65,139,940 | 260.56 | 214.0 |
| train-00043-of-00077 | 65,003,795 | 260.02 | 213.0 |
| train-00044-of-00077 | 65,138,572 | 260.55 | 215.0 |
| train-00045-of-00077 | 65,021,642 | 260.09 | 214.0 |
| train-00046-of-00077 | 65,014,977 | 260.06 | 214.0 |
| train-00047-of-00077 | 64,943,326 | 259.77 | 213.0 |
| train-00048-of-00077 | 64,953,491 | 259.81 | 214.0 |
| train-00049-of-00077 | 64,863,204 | 259.45 | 213.0 |
| train-00050-of-00077 | 64,741,302 | 258.97 | 213.0 |
| train-00051-of-00077 | 65,125,371 | 260.50 | 214.0 |
| train-00052-of-00077 | 64,968,804 | 259.88 | 214.0 |
| train-00053-of-00077 | 65,125,921 | 260.50 | 214.0 |
| train-00054-of-00077 | 64,974,042 | 259.90 | 214.0 |
| train-00055-of-00077 | 64,990,107 | 259.96 | 213.0 |
| train-00056-of-00077 | 64,826,847 | 259.31 | 213.0 |
| train-00057-of-00077 | 64,877,379 | 259.51 | 213.0 |
| train-00058-of-00077 | 64,888,841 | 259.56 | 213.0 |
| train-00059-of-00077 | 65,076,924 | 260.31 | 213.0 |
| train-00060-of-00077 | 64,894,418 | 259.58 | 213.0 |
| train-00061-of-00077 | 65,030,084 | 260.12 | 214.0 |
| train-00062-of-00077 | 64,883,522 | 259.53 | 213.0 |
| train-00063-of-00077 | 64,996,849 | 259.99 | 213.0 |
| train-00064-of-00077 | 64,954,207 | 259.82 | 213.0 |
| train-00065-of-00077 | 64,868,166 | 259.47 | 213.0 |
| train-00066-of-00077 | 64,961,856 | 259.85 | 213.0 |
| train-00067-of-00077 | 65,027,897 | 260.11 | 213.0 |
| train-00068-of-00077 | 64,969,800 | 259.88 | 214.0 |
| train-00069-of-00077 | 65,051,706 | 260.21 | 214.0 |
| train-00070-of-00077 | 64,987,045 | 259.95 | 214.0 |
| train-00071-of-00077 | 64,952,257 | 259.81 | 213.0 |
| train-00071-of-00077 | 64,952,257 | 259.81 | 213.0 |
| train-00072-of-00077 | 64,979,962 | 259.92 | 213.0 |
| train-00073-of-00077 | 65,128,788 | 260.52 | 214.0 |
| train-00074-of-00077 | 65,107,412 | 260.43 | 214.0 |
| train-00075-of-00077 | 65,089,811 | 260.36 | 214.0 |
| train-00076-of-00077 | 6,154,579 | 260.77 | 217.0 |
提供机构:
Ba2han



