spdenisov/udtrees
收藏Hugging Face2023-03-29 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/spdenisov/udtrees
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: language
dtype: string
- name: sentence
dtype: string
- name: conllu
dtype: string
splits:
- name: ru_syntagrus_ud_train_a
num_bytes: 39997721
num_examples: 24516
- name: en_ewt_ud_train
num_bytes: 13066595
num_examples: 12544
- name: es_ancora_ud_train
num_bytes: 41576563
num_examples: 14287
- name: ga_idt_ud_train
num_bytes: 6578580
num_examples: 4005
- name: tr_tourism_ud_train
num_bytes: 5072132
num_examples: 15476
- name: ar_nyuad_ud_train
num_bytes: 46449076
num_examples: 15789
- name: cop_scriptorium_ud_train
num_bytes: 3127527
num_examples: 1379
- name: tr_kenet_ud_train
num_bytes: 9965621
num_examples: 15398
- name: ar_padt_ud_train
num_bytes: 39971051
num_examples: 6075
- name: tr_penn_ud_train
num_bytes: 11428060
num_examples: 14850
- name: es_gsd_ud_train
num_bytes: 22823430
num_examples: 14187
- name: fi_tdt_ud_train
num_bytes: 13228364
num_examples: 12217
- name: nl_alpino_ud_train
num_bytes: 13981525
num_examples: 12289
- name: fi_ftb_ud_train
num_bytes: 10264036
num_examples: 14981
- name: ru_syntagrus_ud_train_b
num_bytes: 42083027
num_examples: 24298
- name: no_nynorsk_ud_train
num_bytes: 14940608
num_examples: 14174
- name: de_hdt_ud_train_a_2
num_bytes: 49150973
num_examples: 37515
- name: hu_szeged_ud_train
num_bytes: 1445467
num_examples: 910
- name: cs_pdt_ud_train_l
num_bytes: 79765505
num_examples: 41559
- name: de_hdt_ud_train_a_1
num_bytes: 50530678
num_examples: 38102
- name: tr_boun_ud_train
num_bytes: 7821321
num_examples: 7803
- name: fr_gsd_ud_train
num_bytes: 22444299
num_examples: 14450
- name: no_bokmaal_ud_train
num_bytes: 14918030
num_examples: 15696
- name: fr_partut_ud_train
num_bytes: 1515774
num_examples: 803
- name: de_gsd_ud_train
num_bytes: 19353463
num_examples: 13814
- name: fr_rhapsodie_ud_train
num_bytes: 1191845
num_examples: 1288
- name: en_partut_ud_train
num_bytes: 2341782
num_examples: 1781
- name: cs_cac_ud_train
num_bytes: 52776214
num_examples: 23478
- name: fr_sequoia_ud_train
num_bytes: 3107869
num_examples: 2231
- name: cs_pdt_ud_train_c
num_bytes: 14988159
num_examples: 8938
- name: en_gum_ud_train
num_bytes: 10299158
num_examples: 6911
- name: hy_armtdp_ud_train
num_bytes: 5096313
num_examples: 1974
- name: ru_gsd_ud_train
num_bytes: 6690467
num_examples: 3850
- name: it_parlamint_ud_train
num_bytes: 641089
num_examples: 326
- name: no_nynorsklia_ud_train
num_bytes: 1951602
num_examples: 3412
- name: tr_framenet_ud_train
num_bytes: 1198915
num_examples: 2288
- name: gd_arcosg_ud_train
num_bytes: 4010492
num_examples: 3541
- name: de_hdt_ud_train_b_2
num_bytes: 51033245
num_examples: 39007
- name: it_vit_ud_train
num_bytes: 14017218
num_examples: 8277
- name: zh_gsdsimp_ud_train
num_bytes: 5375774
num_examples: 3997
- name: fr_ftb_ud_train
num_bytes: 24036178
num_examples: 14759
- name: cy_ccg_ud_train
num_bytes: 1370915
num_examples: 1111
- name: de_hdt_ud_train_b_1
num_bytes: 53015860
num_examples: 38411
- name: zh_gsd_ud_train
num_bytes: 5375739
num_examples: 3997
- name: hy_bsut_ud_train
num_bytes: 2570067
num_examples: 1226
- name: fr_parisstories_ud_train
num_bytes: 1434200
num_examples: 1390
- name: gv_cadhan_ud_train
num_bytes: 547774
num_examples: 1172
- name: ro_rrt_ud_train
num_bytes: 14443371
num_examples: 8043
- name: pt_cintil_ud_train
num_bytes: 19037477
num_examples: 30720
- name: ru_taiga_ud_train
num_bytes: 14956116
num_examples: 16045
- name: cs_pdt_ud_train_m
num_bytes: 20158243
num_examples: 11180
- name: tr_atis_ud_train
num_bytes: 2633984
num_examples: 4274
- name: cs_pdt_ud_train_v
num_bytes: 14454519
num_examples: 6818
- name: it_isdt_ud_train
num_bytes: 19225718
num_examples: 13121
- name: ru_syntagrus_ud_train_c
num_bytes: 30439785
num_examples: 20816
- name: cs_fictree_ud_train
num_bytes: 13380642
num_examples: 10160
- name: en_atis_ud_train
num_bytes: 2524032
num_examples: 4274
- name: en_lines_ud_train
num_bytes: 3264741
num_examples: 3176
- name: da_ddt_ud_train
num_bytes: 5047075
num_examples: 4383
- name: fa_seraji_ud_train
num_bytes: 11517586
num_examples: 4798
- name: fa_perdt_ud_train
num_bytes: 29881906
num_examples: 26196
download_size: 335579995
dataset_size: 1045535496
---
# Dataset Card for "udtrees"
[More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
提供机构:
spdenisov
原始信息汇总
数据集概述
数据集特征
- language: 数据类型为字符串
- sentence: 数据类型为字符串
- conllu: 数据类型为字符串
数据集分割
数据集包含多个语言和对应的训练集,每个分割包括以下信息:
- 名称: 语言和数据集类型
- 字节数: 数据大小
- 示例数: 数据集中的样本数量
数据集统计
- 下载大小: 335579995 字节
- 数据集大小: 1045535496 字节
数据集分割详情
| 名称 | 字节数 | 示例数 |
|---|---|---|
| ru_syntagrus_ud_train_a | 39997721 | 24516 |
| en_ewt_ud_train | 13066595 | 12544 |
| es_ancora_ud_train | 41576563 | 14287 |
| ga_idt_ud_train | 6578580 | 4005 |
| tr_tourism_ud_train | 5072132 | 15476 |
| ar_nyuad_ud_train | 46449076 | 15789 |
| cop_scriptorium_ud_train | 3127527 | 1379 |
| tr_kenet_ud_train | 9965621 | 15398 |
| ar_padt_ud_train | 39971051 | 6075 |
| tr_penn_ud_train | 11428060 | 14850 |
| es_gsd_ud_train | 22823430 | 14187 |
| fi_tdt_ud_train | 13228364 | 12217 |
| nl_alpino_ud_train | 13981525 | 12289 |
| fi_ftb_ud_train | 10264036 | 14981 |
| ru_syntagrus_ud_train_b | 42083027 | 24298 |
| no_nynorsk_ud_train | 14940608 | 14174 |
| de_hdt_ud_train_a_2 | 49150973 | 37515 |
| hu_szeged_ud_train | 1445467 | 910 |
| cs_pdt_ud_train_l | 79765505 | 41559 |
| de_hdt_ud_train_a_1 | 50530678 | 38102 |
| tr_boun_ud_train | 7821321 | 7803 |
| fr_gsd_ud_train | 22444299 | 14450 |
| no_bokmaal_ud_train | 14918030 | 15696 |
| fr_partut_ud_train | 1515774 | 803 |
| de_gsd_ud_train | 19353463 | 13814 |
| fr_rhapsodie_ud_train | 1191845 | 1288 |
| en_partut_ud_train | 2341782 | 1781 |
| cs_cac_ud_train | 52776214 | 23478 |
| fr_sequoia_ud_train | 3107869 | 2231 |
| cs_pdt_ud_train_c | 14988159 | 8938 |
| en_gum_ud_train | 10299158 | 6911 |
| hy_armtdp_ud_train | 5096313 | 1974 |
| ru_gsd_ud_train | 6690467 | 3850 |
| it_parlamint_ud_train | 641089 | 326 |
| no_nynorsklia_ud_train | 1951602 | 3412 |
| tr_framenet_ud_train | 1198915 | 2288 |
| gd_arcosg_ud_train | 4010492 | 3541 |
| de_hdt_ud_train_b_2 | 51033245 | 39007 |
| it_vit_ud_train | 14017218 | 8277 |
| zh_gsdsimp_ud_train | 5375774 | 3997 |
| fr_ftb_ud_train | 24036178 | 14759 |
| cy_ccg_ud_train | 1370915 | 1111 |
| de_hdt_ud_train_b_1 | 53015860 | 38411 |
| zh_gsd_ud_train | 5375739 | 3997 |
| hy_bsut_ud_train | 2570067 | 1226 |
| fr_parisstories_ud_train | 1434200 | 1390 |
| gv_cadhan_ud_train | 547774 | 1172 |
| ro_rrt_ud_train | 14443371 | 8043 |
| pt_cintil_ud_train | 19037477 | 30720 |
| ru_taiga_ud_train | 14956116 | 16045 |
| cs_pdt_ud_train_m | 20158243 | 11180 |
| tr_atis_ud_train | 2633984 | 4274 |
| cs_pdt_ud_train_v | 14454519 | 6818 |
| it_isdt_ud_train | 19225718 | 13121 |
| ru_syntagrus_ud_train_c | 30439785 | 20816 |
| cs_fictree_ud_train | 13380642 | 10160 |
| en_atis_ud_train | 2524032 | 4274 |
| en_lines_ud_train | 3264741 | 3176 |
| da_ddt_ud_train | 5047075 | 4383 |
| fa_seraji_ud_train | 11517586 | 4798 |
| fa_perdt_ud_train | 29881906 | 26196 |
以上为数据集的详细分割信息,包括各语言对应的训练集大小和样本数量。
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是一个多语言依存句法树数据集,包含约724,000行文本数据,覆盖俄语、英语、西班牙语等多种语言的训练子集。数据以parquet格式存储,提供详细的句法标注信息,适用于自然语言处理任务,如依存句法分析和语言建模。
以上内容由遇见数据集搜集并总结生成



