five

spdenisov/tokenized_udtree

收藏
Hugging Face2023-03-28 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/spdenisov/tokenized_udtree
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: input_ids sequence: int32 - name: attention_mask sequence: int8 splits: - name: cs_0 num_bytes: 73985244 num_examples: 102133 - name: cs_1 num_bytes: 95459594 num_examples: 102133 - name: cs_2 num_bytes: 95354064 num_examples: 102133 - name: cs_3 num_bytes: 128817619 num_examples: 102133 - name: cs_4 num_bytes: 236925044 num_examples: 102133 - name: cs_5 num_bytes: 115688159 num_examples: 102133 - name: cs_6 num_bytes: 132404489 num_examples: 102133 - name: tr_0 num_bytes: 28666902 num_examples: 60089 - name: tr_1 num_bytes: 31887742 num_examples: 60089 - name: tr_2 num_bytes: 31749302 num_examples: 60089 - name: tr_3 num_bytes: 28498032 num_examples: 60089 - name: tr_4 num_bytes: 57177672 num_examples: 60089 - name: tr_5 num_bytes: 37804587 num_examples: 60089 - name: tr_6 num_bytes: 28280762 num_examples: 60089 - name: ar_0 num_bytes: 32848442 num_examples: 21864 - name: ar_1 num_bytes: 49955197 num_examples: 21864 - name: ar_2 num_bytes: 49285292 num_examples: 21864 - name: ar_3 num_bytes: 69585617 num_examples: 21864 - name: ar_4 num_bytes: 91649737 num_examples: 21864 - name: ar_5 num_bytes: 59303592 num_examples: 21864 - name: ar_6 num_bytes: 50935047 num_examples: 21864 - name: de_0 num_bytes: 112997417 num_examples: 166849 - name: de_1 num_bytes: 149332477 num_examples: 166849 - name: de_2 num_bytes: 157628127 num_examples: 166849 - name: de_3 num_bytes: 155444887 num_examples: 166849 - name: de_4 num_bytes: 309419752 num_examples: 166849 - name: de_5 num_bytes: 191783977 num_examples: 166849 - name: de_6 num_bytes: 138689312 num_examples: 166849 - name: fr_0 num_bytes: 27905013 num_examples: 34921 - name: fr_1 num_bytes: 41237113 num_examples: 34921 - name: fr_2 num_bytes: 45655098 num_examples: 34921 - name: fr_3 num_bytes: 39973853 num_examples: 34921 - name: fr_4 num_bytes: 76420558 num_examples: 34921 - name: fr_5 num_bytes: 56197173 num_examples: 34921 - name: fr_6 num_bytes: 39938223 num_examples: 34921 - name: no_0 num_bytes: 19584526 num_examples: 33282 - name: no_1 num_bytes: 25823376 num_examples: 33282 - name: no_2 num_bytes: 26954416 num_examples: 33282 - name: no_3 num_bytes: 23459636 num_examples: 33282 - name: no_4 num_bytes: 43762856 num_examples: 33282 - name: no_5 num_bytes: 32578281 num_examples: 33282 - name: no_6 num_bytes: 23459636 num_examples: 33282 - name: pt_0 num_bytes: 12627085 num_examples: 30720 - name: pt_1 num_bytes: 16475005 num_examples: 30720 - name: pt_2 num_bytes: 17295815 num_examples: 30720 - name: pt_3 num_bytes: 16917200 num_examples: 30720 - name: pt_4 num_bytes: 24168495 num_examples: 30720 - name: pt_5 num_bytes: 20520155 num_examples: 30720 - name: pt_6 num_bytes: 15115165 num_examples: 30720 - name: es_0 num_bytes: 27551907 num_examples: 28474 - name: es_1 num_bytes: 39391152 num_examples: 28474 - name: es_2 num_bytes: 42349787 num_examples: 28474 - name: es_3 num_bytes: 43743597 num_examples: 28474 - name: es_4 num_bytes: 69878787 num_examples: 28474 - name: es_5 num_bytes: 51203677 num_examples: 28474 - name: es_6 num_bytes: 46914367 num_examples: 28474 - name: ru_0 num_bytes: 57566900 num_examples: 89525 - name: ru_1 num_bytes: 74853550 num_examples: 89525 - name: ru_2 num_bytes: 76555950 num_examples: 89525 - name: ru_3 num_bytes: 67072565 num_examples: 89525 - name: ru_4 num_bytes: 155012405 num_examples: 89525 - name: ru_5 num_bytes: 92396515 num_examples: 89525 - name: ru_6 num_bytes: 98333345 num_examples: 89525 - name: en_0 num_bytes: 14945668 num_examples: 28686 - name: en_1 num_bytes: 20836733 num_examples: 28686 - name: en_2 num_bytes: 23313373 num_examples: 28686 - name: en_3 num_bytes: 21978133 num_examples: 28686 - name: en_4 num_bytes: 32732303 num_examples: 28686 - name: en_5 num_bytes: 28539183 num_examples: 28686 - name: en_6 num_bytes: 28399343 num_examples: 28686 - name: fi_0 num_bytes: 14729969 num_examples: 27198 - name: fi_1 num_bytes: 17656509 num_examples: 27198 - name: fi_2 num_bytes: 16915489 num_examples: 27198 - name: fi_3 num_bytes: 18732354 num_examples: 27198 - name: fi_4 num_bytes: 29894674 num_examples: 27198 - name: fi_5 num_bytes: 20079089 num_examples: 27198 - name: fi_6 num_bytes: 18874279 num_examples: 27198 - name: gd_0 num_bytes: 2829948 num_examples: 3541 - name: gd_1 num_bytes: 3700318 num_examples: 3541 - name: gd_2 num_bytes: 3798313 num_examples: 3541 - name: gd_3 num_bytes: 3907648 num_examples: 3541 - name: gd_4 num_bytes: 5359963 num_examples: 3541 - name: gd_5 num_bytes: 4693368 num_examples: 3541 - name: gd_6 num_bytes: 3383253 num_examples: 3541 - name: gv_0 num_bytes: 456221 num_examples: 1172 - name: gv_1 num_bytes: 597391 num_examples: 1172 - name: gv_2 num_bytes: 609501 num_examples: 1172 - name: gv_3 num_bytes: 542486 num_examples: 1172 - name: gv_4 num_bytes: 785231 num_examples: 1172 - name: gv_5 num_bytes: 729026 num_examples: 1172 - name: gv_6 num_bytes: 542486 num_examples: 1172 - name: ga_0 num_bytes: 3928820 num_examples: 4005 - name: ga_1 num_bytes: 5021230 num_examples: 4005 - name: ga_2 num_bytes: 5059580 num_examples: 4005 - name: ga_3 num_bytes: 4843745 num_examples: 4005 - name: ga_4 num_bytes: 9085760 num_examples: 4005 - name: ga_5 num_bytes: 6197075 num_examples: 4005 - name: ga_6 num_bytes: 4483365 num_examples: 4005 - name: cop_0 num_bytes: 4660032 num_examples: 1379 - name: cop_1 num_bytes: 5726842 num_examples: 1379 - name: cop_2 num_bytes: 4508942 num_examples: 1379 - name: cop_3 num_bytes: 4496787 num_examples: 1379 - name: cop_4 num_bytes: 5425137 num_examples: 1379 - name: cop_5 num_bytes: 4907442 num_examples: 1379 - name: cop_6 num_bytes: 4284382 num_examples: 1379 - name: it_0 num_bytes: 17989232 num_examples: 21724 - name: it_1 num_bytes: 25839627 num_examples: 21724 - name: it_2 num_bytes: 27448052 num_examples: 21724 - name: it_3 num_bytes: 24875027 num_examples: 21724 - name: it_4 num_bytes: 43731272 num_examples: 21724 - name: it_5 num_bytes: 33091747 num_examples: 21724 - name: it_6 num_bytes: 30955017 num_examples: 21724 - name: cy_0 num_bytes: 907518 num_examples: 1111 - name: cy_1 num_bytes: 1180383 num_examples: 1111 - name: cy_2 num_bytes: 1192068 num_examples: 1111 - name: cy_3 num_bytes: 1123428 num_examples: 1111 - name: cy_4 num_bytes: 1834888 num_examples: 1111 - name: cy_5 num_bytes: 1439843 num_examples: 1111 - name: cy_6 num_bytes: 1055223 num_examples: 1111 - name: hu_0 num_bytes: 858340 num_examples: 910 - name: hu_1 num_bytes: 1088085 num_examples: 910 - name: hu_2 num_bytes: 1086220 num_examples: 910 - name: hu_3 num_bytes: 957490 num_examples: 910 - name: hu_4 num_bytes: 1964920 num_examples: 910 - name: hu_5 num_bytes: 1370660 num_examples: 910 - name: hu_6 num_bytes: 957490 num_examples: 910 - name: zh_0 num_bytes: 9051347 num_examples: 7994 - name: zh_1 num_bytes: 12537582 num_examples: 7994 - name: zh_2 num_bytes: 11419717 num_examples: 7994 - name: zh_3 num_bytes: 10888407 num_examples: 7994 - name: zh_4 num_bytes: 10558847 num_examples: 7994 - name: zh_5 num_bytes: 13867342 num_examples: 7994 - name: zh_6 num_bytes: 10167967 num_examples: 7994 - name: hy_0 num_bytes: 5120790 num_examples: 3200 - name: hy_1 num_bytes: 5762195 num_examples: 3200 - name: hy_2 num_bytes: 4712195 num_examples: 3200 - name: hy_3 num_bytes: 4260805 num_examples: 3200 - name: hy_4 num_bytes: 8546900 num_examples: 3200 - name: hy_5 num_bytes: 5442440 num_examples: 3200 - name: hy_6 num_bytes: 4260805 num_examples: 3200 - name: ro_0 num_bytes: 6894274 num_examples: 8043 - name: ro_1 num_bytes: 9156564 num_examples: 8043 - name: ro_2 num_bytes: 9493574 num_examples: 8043 - name: ro_3 num_bytes: 10830604 num_examples: 8043 - name: ro_4 num_bytes: 20320209 num_examples: 8043 - name: ro_5 num_bytes: 11507314 num_examples: 8043 - name: ro_6 num_bytes: 8300564 num_examples: 8043 - name: da_0 num_bytes: 2963139 num_examples: 4383 - name: da_1 num_bytes: 3945104 num_examples: 4383 - name: da_2 num_bytes: 4115634 num_examples: 4383 - name: da_3 num_bytes: 3583269 num_examples: 4383 - name: da_4 num_bytes: 7089004 num_examples: 4383 - name: da_5 num_bytes: 4981724 num_examples: 4383 - name: da_6 num_bytes: 3583269 num_examples: 4383 - name: nl_0 num_bytes: 6741817 num_examples: 12289 - name: nl_1 num_bytes: 8989392 num_examples: 12289 - name: nl_2 num_bytes: 9389757 num_examples: 12289 - name: nl_3 num_bytes: 16004832 num_examples: 12289 - name: nl_4 num_bytes: 12089687 num_examples: 12289 - name: nl_5 num_bytes: 11410547 num_examples: 12289 - name: nl_6 num_bytes: 12631912 num_examples: 12289 download_size: 934434422 dataset_size: 5264208717 --- # Dataset Card for "tokenized_udtree" [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
提供机构:
spdenisov
原始信息汇总

数据集概述

数据集名称

"tokenized_udtree"

数据集大小

  • 下载大小: 934434422 字节
  • 数据集大小: 5264208717 字节

数据集特征

  • input_ids: 序列类型为 int32
  • attention_mask: 序列类型为 int8

数据集拆分

数据集包含多个语言和多个拆分,每个拆分具有固定的示例数量和字节数。以下是部分语言及其对应的拆分信息:

  • cs (捷克语):

    • cs_0: 102133 个示例, 73985244 字节
    • cs_1: 102133 个示例, 95459594 字节
    • cs_2: 102133 个示例, 95354064 字节
    • cs_3: 102133 个示例, 128817619 字节
    • cs_4: 102133 个示例, 236925044 字节
    • cs_5: 102133 个示例, 115688159 字节
    • cs_6: 102133 个示例, 132404489 字节
  • tr (土耳其语):

    • tr_0: 60089 个示例, 28666902 字节
    • tr_1: 60089 个示例, 31887742 字节
    • tr_2: 60089 个示例, 31749302 字节
    • tr_3: 60089 个示例, 28498032 字节
    • tr_4: 60089 个示例, 57177672 字节
    • tr_5: 60089 个示例, 37804587 字节
    • tr_6: 60089 个示例, 28280762 字节
  • ar (阿拉伯语):

    • ar_0: 21864 个示例, 32848442 字节
    • ar_1: 21864 个示例, 49955197 字节
    • ar_2: 21864 个示例, 49285292 字节
    • ar_3: 21864 个示例, 69585617 字节
    • ar_4: 21864 个示例, 91649737 字节
    • ar_5: 21864 个示例, 59303592 字节
    • ar_6: 21864 个示例, 50935047 字节
  • de (德语):

    • de_0: 166849 个示例, 112997417 字节
    • de_1: 166849 个示例, 149332477 字节
    • de_2: 166849 个示例, 157628127 字节
    • de_3: 166849 个示例, 155444887 字节
    • de_4: 166849 个示例, 309419752 字节
    • de_5: 166849 个示例, 191783977 字节
    • de_6: 166849 个示例, 138689312 字节
  • fr (法语):

    • fr_0: 34921 个示例, 27905013 字节
    • fr_1: 34921 个示例, 41237113 字节
    • fr_2: 34921 个示例, 45655098 字节
    • fr_3: 34921 个示例, 39973853 字节
    • fr_4: 34921 个示例, 76420558 字节
    • fr_5: 34921 个示例, 56197173 字节
    • fr_6: 34921 个示例, 39938223 字节
  • no (挪威语):

    • no_0: 33282 个示例, 19584526 字节
    • no_1: 33282 个示例, 25823376 字节
    • no_2: 33282 个示例, 26954416 字节
    • no_3: 33282 个示例, 23459636 字节
    • no_4: 33282 个示例, 43762856 字节
    • no_5: 33282 个示例, 32578281 字节
    • no_6: 33282 个示例, 23459636 字节
  • pt (葡萄牙语):

    • pt_0: 30720 个示例, 12627085 字节
    • pt_1: 30720 个示例, 16475005 字节
    • pt_2: 30720 个示例, 17295815 字节
    • pt_3: 30720 个示例, 16917200 字节
    • pt_4: 30720 个示例, 24168495 字节
    • pt_5: 30720 个示例, 20520155 字节
    • pt_6: 30720 个示例, 15115165 字节
  • es (西班牙语):

    • es_0: 28474 个示例, 27551907 字节
    • es_1: 28474 个示例, 39391152 字节
    • es_2: 28474 个示例, 42349787 字节
    • es_3: 28474 个示例, 43743597 字节
    • es_4: 28474 个示例, 69878787 字节
    • es_5: 28474 个示例, 51203677 字节
    • es_6: 28474 个示例, 46914367 字节
  • ru (俄语):

    • ru_0: 89525 个示例, 57566900 字节
    • ru_1: 89525 个示例, 74853550 字节
    • ru_2: 89525 个示例, 76555950 字节
    • ru_3: 89525 个示例, 67072565 字节
    • ru_4: 89525 个示例, 155012405 字节
    • ru_5: 89525 个示例, 92396515 字节
    • ru_6: 89525 个示例, 98333345 字节
  • en (英语):

    • en_0: 28686 个示例, 14945668 字节
    • en_1: 28686 个示例, 20836733 字节
    • en_2: 28686 个示例, 23313373 字节
    • en_3: 28686 个示例, 21978133 字节
    • en_4: 28686 个示例, 32732303 字节
    • en_5: 28686 个示例, 28539183 字节
    • en_6: 28686 个示例, 28399343 字节
  • fi (芬兰语):

    • fi_0: 27198 个示例, 14729969 字节
    • fi_1: 27198 个示例, 17656509 字节
    • fi_2: 27198 个示例, 16915489 字节
    • fi_3: 27198 个示例, 18732354 字节
    • fi_4: 27198 个示例, 29894674 字节
    • fi_5: 27198 个示例, 20079089 字节
    • fi_6: 27198 个示例, 18874279 字节
  • gd (苏格兰盖尔语):

    • gd_0: 3541 个示例, 2829948 字节
    • gd_1: 3541 个示例, 3700318 字节
    • gd_2: 3541 个示例, 3798313 字节
    • gd_3: 3541 个示例, 3907648 字节
    • gd_4: 3541 个示例, 5359963 字节
    • gd_5: 3541 个示例, 4693368 字节
    • gd_6: 3541 个示例, 3383253 字节
  • gv (马恩岛语):

    • gv_0: 1172 个示例, 456221 字节
    • gv_1: 1172 个示例, 597391 字节
    • gv_2: 1172 个示例, 609501 字节
    • gv_3: 1172 个示例, 542486 字节
    • gv_4: 1172 个示例, 785231 字节
    • gv_5: 1172 个示例, 729026 字节
    • gv_6: 1172 个示例, 542486 字节
  • ga (爱尔兰语):

    • ga_0: 4005 个示例, 3928820 字节
    • ga_1: 4005 个示例, 5021230 字节
    • ga_2: 4005 个示例, 5059580 字节
    • ga_3: 4005 个示例, 4843745 字节
    • ga_4: 4005 个示例, 9085760 字节
    • ga_5: 4005 个示例, 6197075 字节
    • ga_6: 4005 个示例, 4483365 字节
  • cop (科普特语):

    • cop_0: 1379 个示例, 4660032 字节
    • cop_1: 1379 个示例, 5726842 字节
    • cop_2: 1379 个示例, 4508942 字节
    • cop_3: 1379 个示例, 4496787 字节
    • cop_4: 1379 个示例, 5425137 字节
    • cop_5: 1379 个示例, 4907442 字节
    • cop_6: 1379 个示例, 4284382 字节
  • it (意大利语):

    • it_0: 21724 个示例, 17989232 字节
    • it_1: 21724 个示例, 25839627 字节
    • it_2: 21724 个示例, 27448052 字节
    • it_3: 21724 个示例, 24875027 字节
    • it_4: 21724 个示例, 43731272 字节
    • it_5: 21724 个示例, 33091747 字节
    • it_6: 21724 个示例, 30955017 字节
  • cy (威尔士语):

    • cy_0: 1111 个示例, 907518 字节
    • cy_1: 1111 个示例, 1180383 字节
    • cy_2: 1111 个示例, 1192068 字节
    • cy_3: 1111 个示例, 1123428 字节
    • cy_4: 1111 个示例, 1834888 字节
    • cy_5: 1111 个示例, 1439843 字节
    • cy_6: 1111 个示例, 1055223 字节
  • hu (匈牙利语):

    • hu_0: 910 个示例, 858340 字节
    • hu_1: 910 个示例, 1088085 字节
    • hu_2: 910 个示例, 1086220 字节
    • hu_3: 910 个示例, 957490 字节
    • hu_4: 910 个示例, 1964920 字节
    • hu_5: 910 个示例, 1370660 字节
    • hu_6: 910 个示例, 957490 字节
  • zh (中文):

    • zh_0: 7994 个示例, 9051347 字节
    • zh
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作