five

Ba2han/tokenized-corpus-0603

收藏
Hugging Face2026-03-07 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Ba2han/tokenized-corpus-0603
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: input_ids sequence: int32 - name: attention_mask sequence: int32 configs: - config_name: default data_files: - split: train path: data/train-* --- # Tokenized Corpus Statistics | Shard | Total Tokens | Avg Tokens | Median Tokens | |-------|--------------|------------|---------------| | train-00000-of-00077 | 64,973,050 | 259.89 | 213.0 | | train-00001-of-00077 | 65,024,877 | 260.10 | 214.0 | | train-00002-of-00077 | 64,976,757 | 259.91 | 213.0 | | train-00003-of-00077 | 65,035,795 | 260.14 | 214.0 | | train-00004-of-00077 | 64,962,798 | 259.85 | 214.0 | | train-00005-of-00077 | 64,945,077 | 259.78 | 213.0 | | train-00006-of-00077 | 64,885,161 | 259.54 | 213.0 | | train-00007-of-00077 | 65,060,639 | 260.24 | 214.0 | | train-00008-of-00077 | 64,905,435 | 259.62 | 213.0 | | train-00009-of-00077 | 65,040,770 | 260.16 | 214.0 | | train-00010-of-00077 | 64,912,129 | 259.65 | 213.0 | | train-00011-of-00077 | 64,974,978 | 259.90 | 214.0 | | train-00012-of-00077 | 64,973,804 | 259.90 | 213.0 | | train-00013-of-00077 | 64,913,694 | 259.65 | 213.0 | | train-00014-of-00077 | 64,906,173 | 259.62 | 213.0 | | train-00015-of-00077 | 64,947,254 | 259.79 | 214.0 | | train-00016-of-00077 | 64,808,665 | 259.23 | 212.0 | | train-00017-of-00077 | 64,772,849 | 259.09 | 212.0 | | train-00018-of-00077 | 65,005,488 | 260.02 | 214.0 | | train-00019-of-00077 | 65,010,312 | 260.04 | 214.0 | | train-00020-of-00077 | 64,976,998 | 259.91 | 213.0 | | train-00021-of-00077 | 64,988,846 | 259.96 | 214.0 | | train-00022-of-00077 | 64,950,899 | 259.80 | 214.0 | | train-00023-of-00077 | 64,956,439 | 259.83 | 213.0 | | train-00024-of-00077 | 65,146,074 | 260.58 | 214.0 | | train-00025-of-00077 | 65,174,108 | 260.70 | 214.0 | | train-00026-of-00077 | 64,939,341 | 259.76 | 213.0 | | train-00027-of-00077 | 65,156,021 | 260.62 | 214.0 | | train-00028-of-00077 | 64,775,060 | 259.10 | 213.0 | | train-00029-of-00077 | 64,961,138 | 259.84 | 213.0 | | train-00030-of-00077 | 64,988,300 | 259.95 | 213.0 | | train-00031-of-00077 | 64,945,869 | 259.78 | 213.0 | | train-00032-of-00077 | 64,893,235 | 259.57 | 213.0 | | train-00033-of-00077 | 64,927,042 | 259.71 | 214.0 | | train-00034-of-00077 | 65,101,375 | 260.41 | 214.0 | | train-00035-of-00077 | 65,031,864 | 260.13 | 214.0 | | train-00036-of-00077 | 65,111,482 | 260.45 | 214.0 | | train-00037-of-00077 | 64,860,636 | 259.44 | 213.0 | | train-00038-of-00077 | 65,087,344 | 260.35 | 214.0 | | train-00039-of-00077 | 65,038,176 | 260.15 | 214.0 | | train-00040-of-00077 | 65,062,299 | 260.25 | 214.0 | | train-00041-of-00077 | 64,868,891 | 259.48 | 213.0 | | train-00042-of-00077 | 65,139,940 | 260.56 | 214.0 | | train-00043-of-00077 | 65,003,795 | 260.02 | 213.0 | | train-00044-of-00077 | 65,138,572 | 260.55 | 215.0 | | train-00045-of-00077 | 65,021,642 | 260.09 | 214.0 | | train-00046-of-00077 | 65,014,977 | 260.06 | 214.0 | | train-00047-of-00077 | 64,943,326 | 259.77 | 213.0 | | train-00048-of-00077 | 64,953,491 | 259.81 | 214.0 | | train-00049-of-00077 | 64,863,204 | 259.45 | 213.0 | | train-00050-of-00077 | 64,741,302 | 258.97 | 213.0 | | train-00051-of-00077 | 65,125,371 | 260.50 | 214.0 | | train-00052-of-00077 | 64,968,804 | 259.88 | 214.0 | | train-00053-of-00077 | 65,125,921 | 260.50 | 214.0 | | train-00054-of-00077 | 64,974,042 | 259.90 | 214.0 | | train-00055-of-00077 | 64,990,107 | 259.96 | 213.0 | | train-00056-of-00077 | 64,826,847 | 259.31 | 213.0 | | train-00057-of-00077 | 64,877,379 | 259.51 | 213.0 | | train-00058-of-00077 | 64,888,841 | 259.56 | 213.0 | | train-00059-of-00077 | 65,076,924 | 260.31 | 213.0 | | train-00060-of-00077 | 64,894,418 | 259.58 | 213.0 | | train-00061-of-00077 | 65,030,084 | 260.12 | 214.0 | | train-00062-of-00077 | 64,883,522 | 259.53 | 213.0 | | train-00063-of-00077 | 64,996,849 | 259.99 | 213.0 | | train-00064-of-00077 | 64,954,207 | 259.82 | 213.0 | | train-00065-of-00077 | 64,868,166 | 259.47 | 213.0 | | train-00066-of-00077 | 64,961,856 | 259.85 | 213.0 | | train-00067-of-00077 | 65,027,897 | 260.11 | 213.0 | | train-00068-of-00077 | 64,969,800 | 259.88 | 214.0 | | train-00069-of-00077 | 65,051,706 | 260.21 | 214.0 | | train-00070-of-00077 | 64,987,045 | 259.95 | 214.0 | | train-00071-of-00077 | 64,952,257 | 259.81 | 213.0 | | train-00071-of-00077 | 64,952,257 | 259.81 | 213.0 | | train-00072-of-00077 | 64,979,962 | 259.92 | 213.0 | | train-00073-of-00077 | 65,128,788 | 260.52 | 214.0 | | train-00074-of-00077 | 65,107,412 | 260.43 | 214.0 | | train-00075-of-00077 | 65,089,811 | 260.36 | 214.0 | | train-00076-of-00077 | 6,154,579 | 260.77 | 217.0 |

数据集信息: 特征: - 名称:input_ids,序列为32位整数 - 名称:attention_mask,序列为32位整数 配置项: - 配置名称:default 数据文件: - 拆分集:train(训练集) - 路径:data/train-* # 分词语料库统计信息 | 分片 | 总Token数 | 平均Token数 | Token数中位数 | |-------|--------------|------------|---------------| | train-00000-of-00077 | 64,973,050 | 259.89 | 213.0 | | train-00001-of-00077 | 65,024,877 | 260.10 | 214.0 | | train-00002-of-00077 | 64,976,757 | 259.91 | 213.0 | | train-00003-of-00077 | 65,035,795 | 260.14 | 214.0 | | train-00004-of-00077 | 64,962,798 | 259.85 | 214.0 | | train-00005-of-00077 | 64,945,077 | 259.78 | 213.0 | | train-00006-of-00077 | 64,885,161 | 259.54 | 213.0 | | train-00007-of-00077 | 65,060,639 | 260.24 | 214.0 | | train-00008-of-00077 | 64,905,435 | 259.62 | 213.0 | | train-00009-of-00077 | 65,040,770 | 260.16 | 214.0 | | train-00010-of-00077 | 64,912,129 | 259.65 | 213.0 | | train-00011-of-00077 | 64,974,978 | 259.90 | 214.0 | | train-00012-of-00077 | 64,973,804 | 259.90 | 213.0 | | train-00013-of-00077 | 64,913,694 | 259.65 | 213.0 | | train-00014-of-00077 | 64,906,173 | 259.62 | 213.0 | | train-00015-of-00077 | 64,947,254 | 259.79 | 214.0 | | train-00016-of-00077 | 64,808,665 | 259.23 | 212.0 | | train-00017-of-00077 | 64,772,849 | 259.09 | 212.0 | | train-00018-of-00077 | 65,005,488 | 260.02 | 214.0 | | train-00019-of-00077 | 65,010,312 | 260.04 | 214.0 | | train-00020-of-00077 | 64,976,998 | 259.91 | 213.0 | | train-00021-of-00077 | 64,988,846 | 259.96 | 214.0 | | train-00022-of-00077 | 64,950,899 | 259.80 | 214.0 | | train-00023-of-00077 | 64,956,439 | 259.83 | 213.0 | | train-00024-of-00077 | 65,146,074 | 260.58 | 214.0 | | train-00025-of-00077 | 65,174,108 | 260.70 | 214.0 | | train-00026-of-00077 | 64,939,341 | 259.76 | 213.0 | | train-00027-of-00077 | 65,156,021 | 260.62 | 214.0 | | train-00028-of-00077 | 64,775,060 | 259.10 | 213.0 | | train-00029-of-00077 | 64,961,138 | 259.84 | 213.0 | | train-00030-of-00077 | 64,988,300 | 259.95 | 213.0 | | train-00031-of-00077 | 64,945,869 | 259.78 | 213.0 | | train-00032-of-00077 | 64,893,235 | 259.57 | 213.0 | | train-00033-of-00077 | 64,927,042 | 259.71 | 214.0 | | train-00034-of-00077 | 65,101,375 | 260.41 | 214.0 | | train-00035-of-00077 | 65,031,864 | 260.13 | 214.0 | | train-00036-of-00077 | 65,111,482 | 260.45 | 214.0 | | train-00037-of-00077 | 64,860,636 | 259.44 | 213.0 | | train-00038-of-00077 | 65,087,344 | 260.35 | 214.0 | | train-00039-of-00077 | 65,038,176 | 260.15 | 214.0 | | train-00040-of-00077 | 65,062,299 | 260.25 | 214.0 | | train-00041-of-00077 | 64,868,891 | 259.48 | 213.0 | | train-00042-of-00077 | 65,139,940 | 260.56 | 214.0 | | train-00043-of-00077 | 65,003,795 | 260.02 | 213.0 | | train-00044-of-00077 | 65,138,572 | 260.55 | 215.0 | | train-00045-of-00077 | 65,021,642 | 260.09 | 214.0 | | train-00046-of-00077 | 65,014,977 | 260.06 | 214.0 | | train-00047-of-00077 | 64,943,326 | 259.77 | 213.0 | | train-00048-of-00077 | 64,953,491 | 259.81 | 214.0 | | train-00049-of-00077 | 64,863,204 | 259.45 | 213.0 | | train-00050-of-00077 | 64,741,302 | 258.97 | 213.0 | | train-00051-of-00077 | 65,125,371 | 260.50 | 214.0 | | train-00052-of-00077 | 64,968,804 | 259.88 | 214.0 | | train-00053-of-00077 | 65,125,921 | 260.50 | 214.0 | | train-00054-of-00077 | 64,974,042 | 259.90 | 214.0 | | train-00055-of-00077 | 64,990,107 | 259.96 | 213.0 | | train-00056-of-00077 | 64,826,847 | 259.31 | 213.0 | | train-00057-of-00077 | 64,877,379 | 259.51 | 213.0 | | train-00058-of-00077 | 64,888,841 | 259.56 | 213.0 | | train-00059-of-00077 | 65,076,924 | 260.31 | 213.0 | | train-00060-of-00077 | 64,894,418 | 259.58 | 213.0 | | train-00061-of-00077 | 65,030,084 | 260.12 | 214.0 | | train-00062-of-00077 | 64,883,522 | 259.53 | 213.0 | | train-00063-of-00077 | 64,996,849 | 259.99 | 213.0 | | train-00064-of-00077 | 64,954,207 | 259.82 | 213.0 | | train-00065-of-00077 | 64,868,166 | 259.47 | 213.0 | | train-00066-of-00077 | 64,961,856 | 259.85 | 213.0 | | train-00067-of-00077 | 65,027,897 | 260.11 | 213.0 | | train-00068-of-00077 | 64,969,800 | 259.88 | 214.0 | | train-00069-of-00077 | 65,051,706 | 260.21 | 214.0 | | train-00070-of-00077 | 64,987,045 | 259.95 | 214.0 | | train-00071-of-00077 | 64,952,257 | 259.81 | 213.0 | | train-00071-of-00077 | 64,952,257 | 259.81 | 213.0 | | train-00072-of-00077 | 64,979,962 | 259.92 | 213.0 | | train-00073-of-00077 | 65,128,788 | 260.52 | 214.0 | | train-00074-of-00077 | 65,107,412 | 260.43 | 214.0 | | train-00075-of-00077 | 65,089,811 | 260.36 | 214.0 | | train-00076-of-00077 | 6,154,579 | 260.77 | 217.0 |
提供机构:
Ba2han
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作