spdenisov/tokenized_udtree
收藏Hugging Face2023-03-28 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/spdenisov/tokenized_udtree
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: input_ids
sequence: int32
- name: attention_mask
sequence: int8
splits:
- name: cs_0
num_bytes: 73985244
num_examples: 102133
- name: cs_1
num_bytes: 95459594
num_examples: 102133
- name: cs_2
num_bytes: 95354064
num_examples: 102133
- name: cs_3
num_bytes: 128817619
num_examples: 102133
- name: cs_4
num_bytes: 236925044
num_examples: 102133
- name: cs_5
num_bytes: 115688159
num_examples: 102133
- name: cs_6
num_bytes: 132404489
num_examples: 102133
- name: tr_0
num_bytes: 28666902
num_examples: 60089
- name: tr_1
num_bytes: 31887742
num_examples: 60089
- name: tr_2
num_bytes: 31749302
num_examples: 60089
- name: tr_3
num_bytes: 28498032
num_examples: 60089
- name: tr_4
num_bytes: 57177672
num_examples: 60089
- name: tr_5
num_bytes: 37804587
num_examples: 60089
- name: tr_6
num_bytes: 28280762
num_examples: 60089
- name: ar_0
num_bytes: 32848442
num_examples: 21864
- name: ar_1
num_bytes: 49955197
num_examples: 21864
- name: ar_2
num_bytes: 49285292
num_examples: 21864
- name: ar_3
num_bytes: 69585617
num_examples: 21864
- name: ar_4
num_bytes: 91649737
num_examples: 21864
- name: ar_5
num_bytes: 59303592
num_examples: 21864
- name: ar_6
num_bytes: 50935047
num_examples: 21864
- name: de_0
num_bytes: 112997417
num_examples: 166849
- name: de_1
num_bytes: 149332477
num_examples: 166849
- name: de_2
num_bytes: 157628127
num_examples: 166849
- name: de_3
num_bytes: 155444887
num_examples: 166849
- name: de_4
num_bytes: 309419752
num_examples: 166849
- name: de_5
num_bytes: 191783977
num_examples: 166849
- name: de_6
num_bytes: 138689312
num_examples: 166849
- name: fr_0
num_bytes: 27905013
num_examples: 34921
- name: fr_1
num_bytes: 41237113
num_examples: 34921
- name: fr_2
num_bytes: 45655098
num_examples: 34921
- name: fr_3
num_bytes: 39973853
num_examples: 34921
- name: fr_4
num_bytes: 76420558
num_examples: 34921
- name: fr_5
num_bytes: 56197173
num_examples: 34921
- name: fr_6
num_bytes: 39938223
num_examples: 34921
- name: no_0
num_bytes: 19584526
num_examples: 33282
- name: no_1
num_bytes: 25823376
num_examples: 33282
- name: no_2
num_bytes: 26954416
num_examples: 33282
- name: no_3
num_bytes: 23459636
num_examples: 33282
- name: no_4
num_bytes: 43762856
num_examples: 33282
- name: no_5
num_bytes: 32578281
num_examples: 33282
- name: no_6
num_bytes: 23459636
num_examples: 33282
- name: pt_0
num_bytes: 12627085
num_examples: 30720
- name: pt_1
num_bytes: 16475005
num_examples: 30720
- name: pt_2
num_bytes: 17295815
num_examples: 30720
- name: pt_3
num_bytes: 16917200
num_examples: 30720
- name: pt_4
num_bytes: 24168495
num_examples: 30720
- name: pt_5
num_bytes: 20520155
num_examples: 30720
- name: pt_6
num_bytes: 15115165
num_examples: 30720
- name: es_0
num_bytes: 27551907
num_examples: 28474
- name: es_1
num_bytes: 39391152
num_examples: 28474
- name: es_2
num_bytes: 42349787
num_examples: 28474
- name: es_3
num_bytes: 43743597
num_examples: 28474
- name: es_4
num_bytes: 69878787
num_examples: 28474
- name: es_5
num_bytes: 51203677
num_examples: 28474
- name: es_6
num_bytes: 46914367
num_examples: 28474
- name: ru_0
num_bytes: 57566900
num_examples: 89525
- name: ru_1
num_bytes: 74853550
num_examples: 89525
- name: ru_2
num_bytes: 76555950
num_examples: 89525
- name: ru_3
num_bytes: 67072565
num_examples: 89525
- name: ru_4
num_bytes: 155012405
num_examples: 89525
- name: ru_5
num_bytes: 92396515
num_examples: 89525
- name: ru_6
num_bytes: 98333345
num_examples: 89525
- name: en_0
num_bytes: 14945668
num_examples: 28686
- name: en_1
num_bytes: 20836733
num_examples: 28686
- name: en_2
num_bytes: 23313373
num_examples: 28686
- name: en_3
num_bytes: 21978133
num_examples: 28686
- name: en_4
num_bytes: 32732303
num_examples: 28686
- name: en_5
num_bytes: 28539183
num_examples: 28686
- name: en_6
num_bytes: 28399343
num_examples: 28686
- name: fi_0
num_bytes: 14729969
num_examples: 27198
- name: fi_1
num_bytes: 17656509
num_examples: 27198
- name: fi_2
num_bytes: 16915489
num_examples: 27198
- name: fi_3
num_bytes: 18732354
num_examples: 27198
- name: fi_4
num_bytes: 29894674
num_examples: 27198
- name: fi_5
num_bytes: 20079089
num_examples: 27198
- name: fi_6
num_bytes: 18874279
num_examples: 27198
- name: gd_0
num_bytes: 2829948
num_examples: 3541
- name: gd_1
num_bytes: 3700318
num_examples: 3541
- name: gd_2
num_bytes: 3798313
num_examples: 3541
- name: gd_3
num_bytes: 3907648
num_examples: 3541
- name: gd_4
num_bytes: 5359963
num_examples: 3541
- name: gd_5
num_bytes: 4693368
num_examples: 3541
- name: gd_6
num_bytes: 3383253
num_examples: 3541
- name: gv_0
num_bytes: 456221
num_examples: 1172
- name: gv_1
num_bytes: 597391
num_examples: 1172
- name: gv_2
num_bytes: 609501
num_examples: 1172
- name: gv_3
num_bytes: 542486
num_examples: 1172
- name: gv_4
num_bytes: 785231
num_examples: 1172
- name: gv_5
num_bytes: 729026
num_examples: 1172
- name: gv_6
num_bytes: 542486
num_examples: 1172
- name: ga_0
num_bytes: 3928820
num_examples: 4005
- name: ga_1
num_bytes: 5021230
num_examples: 4005
- name: ga_2
num_bytes: 5059580
num_examples: 4005
- name: ga_3
num_bytes: 4843745
num_examples: 4005
- name: ga_4
num_bytes: 9085760
num_examples: 4005
- name: ga_5
num_bytes: 6197075
num_examples: 4005
- name: ga_6
num_bytes: 4483365
num_examples: 4005
- name: cop_0
num_bytes: 4660032
num_examples: 1379
- name: cop_1
num_bytes: 5726842
num_examples: 1379
- name: cop_2
num_bytes: 4508942
num_examples: 1379
- name: cop_3
num_bytes: 4496787
num_examples: 1379
- name: cop_4
num_bytes: 5425137
num_examples: 1379
- name: cop_5
num_bytes: 4907442
num_examples: 1379
- name: cop_6
num_bytes: 4284382
num_examples: 1379
- name: it_0
num_bytes: 17989232
num_examples: 21724
- name: it_1
num_bytes: 25839627
num_examples: 21724
- name: it_2
num_bytes: 27448052
num_examples: 21724
- name: it_3
num_bytes: 24875027
num_examples: 21724
- name: it_4
num_bytes: 43731272
num_examples: 21724
- name: it_5
num_bytes: 33091747
num_examples: 21724
- name: it_6
num_bytes: 30955017
num_examples: 21724
- name: cy_0
num_bytes: 907518
num_examples: 1111
- name: cy_1
num_bytes: 1180383
num_examples: 1111
- name: cy_2
num_bytes: 1192068
num_examples: 1111
- name: cy_3
num_bytes: 1123428
num_examples: 1111
- name: cy_4
num_bytes: 1834888
num_examples: 1111
- name: cy_5
num_bytes: 1439843
num_examples: 1111
- name: cy_6
num_bytes: 1055223
num_examples: 1111
- name: hu_0
num_bytes: 858340
num_examples: 910
- name: hu_1
num_bytes: 1088085
num_examples: 910
- name: hu_2
num_bytes: 1086220
num_examples: 910
- name: hu_3
num_bytes: 957490
num_examples: 910
- name: hu_4
num_bytes: 1964920
num_examples: 910
- name: hu_5
num_bytes: 1370660
num_examples: 910
- name: hu_6
num_bytes: 957490
num_examples: 910
- name: zh_0
num_bytes: 9051347
num_examples: 7994
- name: zh_1
num_bytes: 12537582
num_examples: 7994
- name: zh_2
num_bytes: 11419717
num_examples: 7994
- name: zh_3
num_bytes: 10888407
num_examples: 7994
- name: zh_4
num_bytes: 10558847
num_examples: 7994
- name: zh_5
num_bytes: 13867342
num_examples: 7994
- name: zh_6
num_bytes: 10167967
num_examples: 7994
- name: hy_0
num_bytes: 5120790
num_examples: 3200
- name: hy_1
num_bytes: 5762195
num_examples: 3200
- name: hy_2
num_bytes: 4712195
num_examples: 3200
- name: hy_3
num_bytes: 4260805
num_examples: 3200
- name: hy_4
num_bytes: 8546900
num_examples: 3200
- name: hy_5
num_bytes: 5442440
num_examples: 3200
- name: hy_6
num_bytes: 4260805
num_examples: 3200
- name: ro_0
num_bytes: 6894274
num_examples: 8043
- name: ro_1
num_bytes: 9156564
num_examples: 8043
- name: ro_2
num_bytes: 9493574
num_examples: 8043
- name: ro_3
num_bytes: 10830604
num_examples: 8043
- name: ro_4
num_bytes: 20320209
num_examples: 8043
- name: ro_5
num_bytes: 11507314
num_examples: 8043
- name: ro_6
num_bytes: 8300564
num_examples: 8043
- name: da_0
num_bytes: 2963139
num_examples: 4383
- name: da_1
num_bytes: 3945104
num_examples: 4383
- name: da_2
num_bytes: 4115634
num_examples: 4383
- name: da_3
num_bytes: 3583269
num_examples: 4383
- name: da_4
num_bytes: 7089004
num_examples: 4383
- name: da_5
num_bytes: 4981724
num_examples: 4383
- name: da_6
num_bytes: 3583269
num_examples: 4383
- name: nl_0
num_bytes: 6741817
num_examples: 12289
- name: nl_1
num_bytes: 8989392
num_examples: 12289
- name: nl_2
num_bytes: 9389757
num_examples: 12289
- name: nl_3
num_bytes: 16004832
num_examples: 12289
- name: nl_4
num_bytes: 12089687
num_examples: 12289
- name: nl_5
num_bytes: 11410547
num_examples: 12289
- name: nl_6
num_bytes: 12631912
num_examples: 12289
download_size: 934434422
dataset_size: 5264208717
---
# Dataset Card for "tokenized_udtree"
[More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
提供机构:
spdenisov
原始信息汇总
数据集概述
数据集名称
"tokenized_udtree"
数据集大小
- 下载大小: 934434422 字节
- 数据集大小: 5264208717 字节
数据集特征
- input_ids: 序列类型为 int32
- attention_mask: 序列类型为 int8
数据集拆分
数据集包含多个语言和多个拆分,每个拆分具有固定的示例数量和字节数。以下是部分语言及其对应的拆分信息:
-
cs (捷克语):
- cs_0: 102133 个示例, 73985244 字节
- cs_1: 102133 个示例, 95459594 字节
- cs_2: 102133 个示例, 95354064 字节
- cs_3: 102133 个示例, 128817619 字节
- cs_4: 102133 个示例, 236925044 字节
- cs_5: 102133 个示例, 115688159 字节
- cs_6: 102133 个示例, 132404489 字节
-
tr (土耳其语):
- tr_0: 60089 个示例, 28666902 字节
- tr_1: 60089 个示例, 31887742 字节
- tr_2: 60089 个示例, 31749302 字节
- tr_3: 60089 个示例, 28498032 字节
- tr_4: 60089 个示例, 57177672 字节
- tr_5: 60089 个示例, 37804587 字节
- tr_6: 60089 个示例, 28280762 字节
-
ar (阿拉伯语):
- ar_0: 21864 个示例, 32848442 字节
- ar_1: 21864 个示例, 49955197 字节
- ar_2: 21864 个示例, 49285292 字节
- ar_3: 21864 个示例, 69585617 字节
- ar_4: 21864 个示例, 91649737 字节
- ar_5: 21864 个示例, 59303592 字节
- ar_6: 21864 个示例, 50935047 字节
-
de (德语):
- de_0: 166849 个示例, 112997417 字节
- de_1: 166849 个示例, 149332477 字节
- de_2: 166849 个示例, 157628127 字节
- de_3: 166849 个示例, 155444887 字节
- de_4: 166849 个示例, 309419752 字节
- de_5: 166849 个示例, 191783977 字节
- de_6: 166849 个示例, 138689312 字节
-
fr (法语):
- fr_0: 34921 个示例, 27905013 字节
- fr_1: 34921 个示例, 41237113 字节
- fr_2: 34921 个示例, 45655098 字节
- fr_3: 34921 个示例, 39973853 字节
- fr_4: 34921 个示例, 76420558 字节
- fr_5: 34921 个示例, 56197173 字节
- fr_6: 34921 个示例, 39938223 字节
-
no (挪威语):
- no_0: 33282 个示例, 19584526 字节
- no_1: 33282 个示例, 25823376 字节
- no_2: 33282 个示例, 26954416 字节
- no_3: 33282 个示例, 23459636 字节
- no_4: 33282 个示例, 43762856 字节
- no_5: 33282 个示例, 32578281 字节
- no_6: 33282 个示例, 23459636 字节
-
pt (葡萄牙语):
- pt_0: 30720 个示例, 12627085 字节
- pt_1: 30720 个示例, 16475005 字节
- pt_2: 30720 个示例, 17295815 字节
- pt_3: 30720 个示例, 16917200 字节
- pt_4: 30720 个示例, 24168495 字节
- pt_5: 30720 个示例, 20520155 字节
- pt_6: 30720 个示例, 15115165 字节
-
es (西班牙语):
- es_0: 28474 个示例, 27551907 字节
- es_1: 28474 个示例, 39391152 字节
- es_2: 28474 个示例, 42349787 字节
- es_3: 28474 个示例, 43743597 字节
- es_4: 28474 个示例, 69878787 字节
- es_5: 28474 个示例, 51203677 字节
- es_6: 28474 个示例, 46914367 字节
-
ru (俄语):
- ru_0: 89525 个示例, 57566900 字节
- ru_1: 89525 个示例, 74853550 字节
- ru_2: 89525 个示例, 76555950 字节
- ru_3: 89525 个示例, 67072565 字节
- ru_4: 89525 个示例, 155012405 字节
- ru_5: 89525 个示例, 92396515 字节
- ru_6: 89525 个示例, 98333345 字节
-
en (英语):
- en_0: 28686 个示例, 14945668 字节
- en_1: 28686 个示例, 20836733 字节
- en_2: 28686 个示例, 23313373 字节
- en_3: 28686 个示例, 21978133 字节
- en_4: 28686 个示例, 32732303 字节
- en_5: 28686 个示例, 28539183 字节
- en_6: 28686 个示例, 28399343 字节
-
fi (芬兰语):
- fi_0: 27198 个示例, 14729969 字节
- fi_1: 27198 个示例, 17656509 字节
- fi_2: 27198 个示例, 16915489 字节
- fi_3: 27198 个示例, 18732354 字节
- fi_4: 27198 个示例, 29894674 字节
- fi_5: 27198 个示例, 20079089 字节
- fi_6: 27198 个示例, 18874279 字节
-
gd (苏格兰盖尔语):
- gd_0: 3541 个示例, 2829948 字节
- gd_1: 3541 个示例, 3700318 字节
- gd_2: 3541 个示例, 3798313 字节
- gd_3: 3541 个示例, 3907648 字节
- gd_4: 3541 个示例, 5359963 字节
- gd_5: 3541 个示例, 4693368 字节
- gd_6: 3541 个示例, 3383253 字节
-
gv (马恩岛语):
- gv_0: 1172 个示例, 456221 字节
- gv_1: 1172 个示例, 597391 字节
- gv_2: 1172 个示例, 609501 字节
- gv_3: 1172 个示例, 542486 字节
- gv_4: 1172 个示例, 785231 字节
- gv_5: 1172 个示例, 729026 字节
- gv_6: 1172 个示例, 542486 字节
-
ga (爱尔兰语):
- ga_0: 4005 个示例, 3928820 字节
- ga_1: 4005 个示例, 5021230 字节
- ga_2: 4005 个示例, 5059580 字节
- ga_3: 4005 个示例, 4843745 字节
- ga_4: 4005 个示例, 9085760 字节
- ga_5: 4005 个示例, 6197075 字节
- ga_6: 4005 个示例, 4483365 字节
-
cop (科普特语):
- cop_0: 1379 个示例, 4660032 字节
- cop_1: 1379 个示例, 5726842 字节
- cop_2: 1379 个示例, 4508942 字节
- cop_3: 1379 个示例, 4496787 字节
- cop_4: 1379 个示例, 5425137 字节
- cop_5: 1379 个示例, 4907442 字节
- cop_6: 1379 个示例, 4284382 字节
-
it (意大利语):
- it_0: 21724 个示例, 17989232 字节
- it_1: 21724 个示例, 25839627 字节
- it_2: 21724 个示例, 27448052 字节
- it_3: 21724 个示例, 24875027 字节
- it_4: 21724 个示例, 43731272 字节
- it_5: 21724 个示例, 33091747 字节
- it_6: 21724 个示例, 30955017 字节
-
cy (威尔士语):
- cy_0: 1111 个示例, 907518 字节
- cy_1: 1111 个示例, 1180383 字节
- cy_2: 1111 个示例, 1192068 字节
- cy_3: 1111 个示例, 1123428 字节
- cy_4: 1111 个示例, 1834888 字节
- cy_5: 1111 个示例, 1439843 字节
- cy_6: 1111 个示例, 1055223 字节
-
hu (匈牙利语):
- hu_0: 910 个示例, 858340 字节
- hu_1: 910 个示例, 1088085 字节
- hu_2: 910 个示例, 1086220 字节
- hu_3: 910 个示例, 957490 字节
- hu_4: 910 个示例, 1964920 字节
- hu_5: 910 个示例, 1370660 字节
- hu_6: 910 个示例, 957490 字节
-
zh (中文):
- zh_0: 7994 个示例, 9051347 字节
- zh



