spdenisov/tokenized_udtrees_trunc
收藏Hugging Face2023-03-30 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/spdenisov/tokenized_udtrees_trunc
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: input_ids
sequence: int32
- name: labels
sequence: int64
- name: attention_mask
sequence: int8
- name: length
dtype: int64
splits:
- name: fr_0
num_bytes: 72813504
num_examples: 34912
- name: fr_1
num_bytes: 106992505
num_examples: 34884
- name: fr_2
num_bytes: 118066880
num_examples: 34858
- name: fr_3
num_bytes: 103747628
num_examples: 34886
- name: fr_4
num_bytes: 179954204
num_examples: 33724
- name: fr_5
num_bytes: 142682805
num_examples: 34681
- name: fr_6
num_bytes: 103669700
num_examples: 34887
- name: ar_0
num_bytes: 76392970
num_examples: 21341
- name: ar_1
num_bytes: 99682724
num_examples: 20211
- name: ar_2
num_bytes: 104828728
num_examples: 20561
- name: ar_3
num_bytes: 120387755
num_examples: 18591
- name: ar_4
num_bytes: 110845444
num_examples: 15239
- name: ar_5
num_bytes: 113333216
num_examples: 19622
- name: ar_6
num_bytes: 97966198
num_examples: 20004
- name: nl_0
num_bytes: 17678650
num_examples: 12289
- name: nl_1
num_bytes: 23522345
num_examples: 12289
- name: nl_2
num_bytes: 24563294
num_examples: 12289
- name: nl_3
num_bytes: 41551823
num_examples: 12274
- name: nl_4
num_bytes: 31583112
num_examples: 12289
- name: nl_5
num_bytes: 29817348
num_examples: 12289
- name: nl_6
num_bytes: 32965583
num_examples: 12287
- name: de_0
num_bytes: 295802185
num_examples: 166848
- name: de_1
num_bytes: 390229614
num_examples: 166845
- name: de_2
num_bytes: 411788885
num_examples: 166844
- name: de_3
num_bytes: 406127223
num_examples: 166845
- name: de_4
num_bytes: 794559733
num_examples: 166061
- name: de_5
num_bytes: 500383319
num_examples: 166830
- name: de_6
num_bytes: 362580545
num_examples: 166846
- name: ru_0
num_bytes: 150571543
num_examples: 89515
- name: ru_1
num_bytes: 195170653
num_examples: 89496
- name: ru_2
num_bytes: 199557398
num_examples: 89494
- name: ru_3
num_bytes: 175089824
num_examples: 89505
- name: ru_4
num_bytes: 385862504
num_examples: 88402
- name: ru_5
num_bytes: 239909307
num_examples: 89442
- name: ru_6
num_bytes: 254396827
num_examples: 89380
- name: pt_0
num_bytes: 33205205
num_examples: 30720
- name: pt_1
num_bytes: 43209797
num_examples: 30720
- name: pt_2
num_bytes: 45343903
num_examples: 30720
- name: pt_3
num_bytes: 44359504
num_examples: 30720
- name: pt_4
num_bytes: 63212871
num_examples: 30720
- name: pt_5
num_bytes: 53727187
num_examples: 30720
- name: pt_6
num_bytes: 39674213
num_examples: 30720
- name: ro_0
num_bytes: 17993349
num_examples: 8041
- name: ro_1
num_bytes: 23770442
num_examples: 8035
- name: ro_2
num_bytes: 24600913
num_examples: 8032
- name: ro_3
num_bytes: 27929669
num_examples: 8023
- name: ro_4
num_bytes: 48677219
num_examples: 7799
- name: ro_5
num_bytes: 29549023
num_examples: 8015
- name: ro_6
num_bytes: 21594484
num_examples: 8038
- name: hy_0
num_bytes: 12162343
num_examples: 3129
- name: hy_1
num_bytes: 13197354
num_examples: 3096
- name: hy_2
num_bytes: 11443297
num_examples: 3149
- name: hy_3
num_bytes: 10501791
num_examples: 3161
- name: hy_4
num_bytes: 16496323
num_examples: 2884
- name: hy_5
num_bytes: 12602551
num_examples: 3107
- name: hy_6
num_bytes: 10501791
num_examples: 3161
- name: en_0
num_bytes: 39190941
num_examples: 28685
- name: en_1
num_bytes: 54446758
num_examples: 28682
- name: en_2
num_bytes: 60866411
num_examples: 28681
- name: en_3
num_bytes: 57413241
num_examples: 28682
- name: en_4
num_bytes: 84543655
num_examples: 28628
- name: en_5
num_bytes: 73953982
num_examples: 28648
- name: en_6
num_bytes: 73215142
num_examples: 28626
- name: hu_0
num_bytes: 2242786
num_examples: 910
- name: hu_1
num_bytes: 2840123
num_examples: 910
- name: hu_2
num_bytes: 2835274
num_examples: 910
- name: hu_3
num_bytes: 2500576
num_examples: 910
- name: hu_4
num_bytes: 4799115
num_examples: 889
- name: hu_5
num_bytes: 3547088
num_examples: 908
- name: hu_6
num_bytes: 2500576
num_examples: 910
- name: tr_0
num_bytes: 75249383
num_examples: 60088
- name: tr_1
num_bytes: 83604892
num_examples: 60087
- name: tr_2
num_bytes: 83243895
num_examples: 60087
- name: tr_3
num_bytes: 74806746
num_examples: 60088
- name: tr_4
num_bytes: 148074211
num_examples: 60006
- name: tr_5
num_bytes: 98925962
num_examples: 60083
- name: tr_6
num_bytes: 74242806
num_examples: 60088
- name: it_0
num_bytes: 46804518
num_examples: 21711
- name: it_1
num_bytes: 66265256
num_examples: 21655
- name: it_2
num_bytes: 70151753
num_examples: 21637
- name: it_3
num_bytes: 63960323
num_examples: 21667
- name: it_4
num_bytes: 100412869
num_examples: 20900
- name: it_5
num_bytes: 82319403
num_examples: 21483
- name: it_6
num_bytes: 77655835
num_examples: 21535
- name: fi_0
num_bytes: 38406525
num_examples: 27185
- name: fi_1
num_bytes: 45852915
num_examples: 27178
- name: fi_2
num_bytes: 43964919
num_examples: 27179
- name: fi_3
num_bytes: 48780830
num_examples: 27184
- name: fi_4
num_bytes: 76447425
num_examples: 27109
- name: fi_5
num_bytes: 51991381
num_examples: 27170
- name: fi_6
num_bytes: 48559262
num_examples: 27153
- name: fa_0
num_bytes: 96243585
num_examples: 30906
- name: fa_1
num_bytes: 113502571
num_examples: 30784
- name: fa_2
num_bytes: 97058237
num_examples: 30894
- name: fa_3
num_bytes: 107038686
num_examples: 30851
- name: fa_4
num_bytes: 112125942
num_examples: 30822
- name: fa_5
num_bytes: 113077898
num_examples: 30767
- name: fa_6
num_bytes: 88091064
num_examples: 30932
- name: gd_0
num_bytes: 7335465
num_examples: 3537
- name: gd_1
num_bytes: 9467949
num_examples: 3530
- name: gd_2
num_bytes: 9689767
num_examples: 3528
- name: gd_3
num_bytes: 9926268
num_examples: 3525
- name: gd_4
num_bytes: 12713464
num_examples: 3465
- name: gd_5
num_bytes: 11546562
num_examples: 3499
- name: gd_6
num_bytes: 8709089
num_examples: 3534
- name: cy_0
num_bytes: 2373101
num_examples: 1111
- name: cy_1
num_bytes: 3082550
num_examples: 1111
- name: cy_2
num_bytes: 3112931
num_examples: 1111
- name: cy_3
num_bytes: 2934467
num_examples: 1111
- name: cy_4
num_bytes: 4784263
num_examples: 1111
- name: cy_5
num_bytes: 3757146
num_examples: 1111
- name: cy_6
num_bytes: 2757134
num_examples: 1111
- name: cs_0
num_bytes: 193204789
num_examples: 102111
- name: cs_1
num_bytes: 248532815
num_examples: 102085
- name: cs_2
num_bytes: 248265366
num_examples: 102085
- name: cs_3
num_bytes: 332530755
num_examples: 101916
- name: cs_4
num_bytes: 537663964
num_examples: 97317
- name: cs_5
num_bytes: 299610164
num_examples: 101990
- name: cs_6
num_bytes: 339589731
num_examples: 101777
- name: es_0
num_bytes: 71968866
num_examples: 28473
- name: es_1
num_bytes: 102260411
num_examples: 28443
- name: es_2
num_bytes: 109651662
num_examples: 28424
- name: es_3
num_bytes: 112979119
num_examples: 28404
- name: es_4
num_bytes: 163186080
num_examples: 27271
- name: es_5
num_bytes: 130959590
num_examples: 28317
- name: es_6
num_bytes: 119790214
num_examples: 28310
- name: zh_0
num_bytes: 23617606
num_examples: 7993
- name: zh_1
num_bytes: 32483372
num_examples: 7980
- name: zh_2
num_bytes: 29697463
num_examples: 7988
- name: zh_3
num_bytes: 28332743
num_examples: 7989
- name: zh_4
num_bytes: 27491845
num_examples: 7990
- name: zh_5
num_bytes: 35551944
num_examples: 7954
- name: zh_6
num_bytes: 26490384
num_examples: 7991
- name: no_0
num_bytes: 51325808
num_examples: 33282
- name: no_1
num_bytes: 67531367
num_examples: 33281
- name: no_2
num_bytes: 70471135
num_examples: 33281
- name: no_3
num_bytes: 61386787
num_examples: 33281
- name: no_4
num_bytes: 113337815
num_examples: 33227
- name: no_5
num_bytes: 84988095
num_examples: 33274
- name: no_6
num_bytes: 61386787
num_examples: 33281
- name: ga_0
num_bytes: 10164126
num_examples: 4000
- name: ga_1
num_bytes: 12904387
num_examples: 3995
- name: ga_2
num_bytes: 13000600
num_examples: 3995
- name: ga_3
num_bytes: 12458429
num_examples: 3996
- name: ga_4
num_bytes: 22263032
num_examples: 3924
- name: ga_5
num_bytes: 15711892
num_examples: 3980
- name: ga_6
num_bytes: 11531217
num_examples: 3996
- name: da_0
num_bytes: 7757634
num_examples: 4383
- name: da_1
num_bytes: 10310743
num_examples: 4383
- name: da_2
num_bytes: 10754121
num_examples: 4383
- name: da_3
num_bytes: 9369972
num_examples: 4383
- name: da_4
num_bytes: 17982417
num_examples: 4351
- name: da_5
num_bytes: 12936123
num_examples: 4378
- name: da_6
num_bytes: 9369972
num_examples: 4383
- name: cop_0
num_bytes: 7622435
num_examples: 1122
- name: cop_1
num_bytes: 7185677
num_examples: 972
- name: cop_2
num_bytes: 7618669
num_examples: 1143
- name: cop_3
num_bytes: 7622440
num_examples: 1145
- name: cop_4
num_bytes: 7298153
num_examples: 1011
- name: cop_5
num_bytes: 7482224
num_examples: 1084
- name: cop_6
num_bytes: 7630235
num_examples: 1174
- name: gv_0
num_bytes: 1200473
num_examples: 1172
- name: gv_1
num_bytes: 1567515
num_examples: 1172
- name: gv_2
num_bytes: 1599001
num_examples: 1172
- name: gv_3
num_bytes: 1424762
num_examples: 1172
- name: gv_4
num_bytes: 2042489
num_examples: 1171
- name: gv_5
num_bytes: 1881763
num_examples: 1170
- name: gv_6
num_bytes: 1424762
num_examples: 1172
download_size: 1339506450
dataset_size: 13867176061
---
# Dataset Card for "tokenized_udtrees_trunc"
[More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
提供机构:
spdenisov
原始信息汇总
数据集概述
数据集名称
"tokenized_udtrees_trunc"
数据集大小
- 下载大小: 1339506450 字节
- 数据集大小: 13867176061 字节
数据集特征
- input_ids: 序列类型为 int32
- labels: 序列类型为 int64
- attention_mask: 序列类型为 int8
- length: 数据类型为 int64
数据集分割
数据集包含多个语言分割,每个分割包含不同数量的字节和示例数。以下是部分语言分割的示例:
- fr_0: 72813504 字节, 34912 示例
- fr_1: 106992505 字节, 34884 示例
- fr_2: 118066880 字节, 34858 示例
- ar_0: 76392970 字节, 21341 示例
- ar_1: 99682724 字节, 20211 示例
- ar_2: 104828728 字节, 20561 示例
- nl_0: 17678650 字节, 12289 示例
- nl_1: 23522345 字节, 12289 示例
- nl_2: 24563294 字节, 12289 示例
- de_0: 295802185 字节, 166848 示例
- de_1: 390229614 字节, 166845 示例
- de_2: 411788885 字节, 166844 示例
- ru_0: 150571543 字节, 89515 示例
- ru_1: 195170653 字节, 89496 示例
- ru_2: 199557398 字节, 89494 示例
- pt_0: 33205205 字节, 30720 示例
- pt_1: 43209797 字节, 30720 示例
- pt_2: 45343903 字节, 30720 示例
- ro_0: 17993349 字节, 8041 示例
- ro_1: 23770442 字节, 8035 示例
- ro_2: 24600913 字节, 8032 示例
- hy_0: 12162343 字节, 3129 示例
- hy_1: 13197354 字节, 3096 示例
- hy_2: 11443297 字节, 3149 示例
- en_0: 39190941 字节, 28685 示例
- en_1: 54446758 字节, 28682 示例
- en_2: 60866411 字节, 28681 示例
- hu_0: 2242786 字节, 910 示例
- hu_1: 2840123 字节, 910 示例
- hu_2: 2835274 字节, 910 示例
- tr_0: 75249383 字节, 60088 示例
- tr_1: 83604892 字节, 60087 示例
- tr_2: 83243895 字节, 60087 示例
- it_0: 46804518 字节, 21711 示例
- it_1: 66265256 字节, 21655 示例
- it_2: 70151753 字节, 21637 示例
- fi_0: 38406525 字节, 27185 示例
- fi_1: 45852915 字节, 27178 示例
- fi_2: 43964919 字节, 27179 示例
- fa_0: 96243585 字节, 30906 示例
- fa_1: 113502571 字节, 30784 示例
- fa_2: 97058237 字节, 30894 示例
- gd_0: 7335465 字节, 3537 示例
- gd_1: 9467949 字节, 3530 示例
- gd_2: 9689767 字节, 3528 示例
- cy_0: 2373101 字节, 1111 示例
- cy_1: 3082550 字节, 1111 示例
- cy_2: 3112931 字节, 1111 示例
- cs_0: 193204789 字节, 102111 示例
- cs_1: 248532815 字节, 102085 示例
- cs_2: 248265366 字节, 102085 示例
- es_0: 71968866 字节, 28473 示例
- es_1: 102260411 字节, 28443 示例
- es_2: 109651662 字节, 28424 示例
- zh_0: 23617606 字节, 7993 示例
- zh_1: 32483372 字节, 7980 示例
- zh_2: 29697463 字节, 7988 示例
- no_0: 51325808 字节, 33282 示例
- no_1: 67531367 字节, 33281 示例
- no_2: 70471135 字节, 33281 示例
- ga_0: 10164126 字节, 4000 示例
- ga_1: 12904387 字节, 3995 示例
- ga_2: 13000600 字节, 3995 示例
- da_0: 7757634 字节, 4383 示例
- da_1: 10310743 字节, 4383 示例
- da_2: 10754121 字节, 4383 示例
- cop_0: 7622435 字节, 1122 示例
- cop_1: 7185677 字节, 972 示例
- cop_2: 7618669 字节, 1143 示例
- gv_0: 1200473 字节, 1172 示例
- gv_1: 1567515 字节, 1172 示例
- gv_2: 1599001 字节, 1172 示例



