Name: tartuNLP/EstNER
Creator: tartuNLP
Published: 2025-12-01 18:54:21
License: 暂无描述

下载链接：

https://hf-mirror.com/datasets/tartuNLP/EstNER

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 configs: - config_name: default data_files: - split: train path: estner-*/train/* - split: dev path: estner-*/dev/* - split: test path: estner-*/test/* - config_name: estner-reannotated data_files: - split: train path: estner-reannotated/train/* - split: dev path: estner-reannotated/dev/* - split: test path: estner-reannotated/test/* - config_name: estner-new data_files: - split: train path: estner-new/train/* - split: dev path: estner-new/dev/* - split: test path: estner-new/test/* language: - et pretty_name: EstNER size_categories: - 10K<n<100K task_categories: - token-classification --- # Estonian Named Entity Recognition (EstNER) ## Dataset Description EstNER dataset for named entity recogintion in Estonian language comprised of two parts: _New EstNER_ and _Reannotated EstNER_ (refer to the corresponding sections of this readme for additional details). By default the joint version of the dataset is loaded. ```python from datasets import load_dataset ds = load_dataset("tartuNLP/EstNER") ``` Each part can be loaded individually, as well. ```python from datasets import load_dataset new_ds = load_dataset("tartuNLP/EstNER", "estner-new") reannotated_ds = load_dataset("tartuNLP/EstNER", "estner-reannotated") ``` ### New Estonian NER dataset The dataset is a collection of Estonian news and social media texts annotated with named entities. #### Dataset statistics The dataset is divided into training, development and test sets. The annotations can be hierarchical, meaning that there can be one named entity inside another. The maximum number of levels in the hierarchical annotations is three. | | Train | Dev | Test | Total | |-----------------|--------|-------|-------|--------| | Documents | 78 | 16 | 15 | 109 | | Sentences | 7001 | 882 | 890 | 8773 | | Tokens | 111858 | 13130 | 14686 | 139674 | |1-level entities | 7480 | 497 | 938 | 8915 | |2-level entities | 571 | 44 | 59 | 674 | |3-level entities | 27 | 0 | 1 | 28 | #### Annotated entities The dataset is annotated with the following entities: * PER - person names * GPE - geopolitical entities * LOC - geographical locations * ORG - organizations * PROD - products, things, works of art * EVENT - events * DATE - dates * TIME - times * TITLE - titles and professions * MONEY - monetary expressions * PERCENT - percentages ##### Level 1 entities | | Train | Dev | Test | Total | |---------|-------|-----|-------|-------| | PER | 2601 | 109 | 299 | 3009 | | ORG | 1177 | 85 | 150 | 1412 | | LOC | 449 | 31 | 35 | 515 | | GPE | 1253 | 129 | 231 | 1613 | | TITLE | 702 | 19 | 59 | 772 | | PROD | 624 | 60 | 117 | 801 | | EVENT | 230 | 15 | 26 | 271 | | DATE | 746 | 64 | 77 | 887 | | TIME | 103 | 6 | 6 | 115 | | PERCENT | 75 | 11 | 1 | 87 | | MONEY | 118 | 12 | 1 | 131 | | Total | 8078 | 541 | 994 | 9613 | ##### Level 2 entities | | Train | Dev | Test | Total | |---------|-------|-----|-------|-------| | PER | 108 | 1 | 14 | 123 | | ORG | 92 | 8 | 6 | 106 | | LOC | 25 | 1 | 0 | 26 | | GPE | 379 | 35 | 42 | 456 | | TITLE | 3 | 0 | 0 | 3 | | PROD | 4 | 0 | 0 | 4 | | EVENT | 1 | 0 | 0 | 1 | | DATE | 10 | 0 | 0 | 10 | | TIME | 0 | 0 | 0 | 0 | | PERCENT | 0 | 0 | 0 | 0 | | MONEY | 0 | 0 | 0 | 0 | | Total | 622 | 45 | 62 | 729 | ##### Level 3 entities | | Train | Dev | Test | Total | |---------|-------|-----|-------|-------| | PER | 1 | 0 | 0 | 1 | | ORG | 0 | 0 | 0 | 0 | | LOC | 1 | 0 | 0 | 1 | | GPE | 25 | 0 | 1 | 26 | | TITLE | 0 | 0 | 0 | 0 | | PROD | 0 | 0 | 0 | 0 | | EVENT | 0 | 0 | 0 | 0 | | DATE | 0 | 0 | 0 | 0 | | TIME | 0 | 0 | 0 | 0 | | PERCENT | 0 | 0 | 0 | 0 | | MONEY | 0 | 0 | 0 | 0 | | Total | 27 | 0 | 1 | 28 | ### Reannotated Estonian NER dataset This is the Estonian NER dataset ([Tkachenko, 2010](https://core.ac.uk/download/pdf/16270382.pdf); [Tkachenko et al., 2013](https://aclanthology.org/W13-2412.pdf)) reannotated with a richer set of entities. #### Dataset statistics The dataset is divided into training, development and test sets. The annotations can be hierarchical, meaning that there can be one named entity inside another. The maximum number of levels in the hierarchical annotations is three. | | Train | Dev | Test | Total | |-----------------|--------|-------|-------|--------| | Documents | 525 | 18 | 39 | 582 | | Sentences | 9965 | 2415 | 1907 | 14287 | | Tokens | 155983 | 32890 | 28370 | 217243 | |1-level entities | 13918 | 2571 | 2396 | 18885 | |2-level entities | 987 | 223 | 122 | 1332 | |3-level entities | 40 | 14 | 4 | 58 | #### Annotated entities Originally, the Estonian NER dataset was annotated with PER, ORG and LOC entities only. The reannotated version is annotated with the following entities: * PER - person names * GPE - geopolitical entities * LOC - geographical locations * ORG - organizations * PROD - products, things, works of art * EVENT - events * DATE - dates * TIME - times * TITLE - titles and professions * MONEY - monetary expressions * PERCENT - percentages ##### Level 1 entities | | Train | Dev | Test | Total | |---------|-------|-----|-------|-------| | PER | 3563 | 642 | 722 | 4927 | | ORG | 3215 | 504 | 541 | 4260 | | LOC | 328 | 118 | 61 | 507 | | GPE | 3377 | 714 | 479 | 4570 | | TITLE | 1302 | 171 | 209 | 1682 | | PROD | 874 | 161 | 66 | 1101 | | EVENT | 56 | 13 | 17 | 86 | | DATE | 1346 | 308 | 186 | 1840 | | TIME | 456 | 39 | 30 | 525 | | PERCENT | 137 | 62 | 58 | 257 | | MONEY | 291 | 76 | 153 | 520 | | Total | 14945 | 2808| 2522 | 20275 | ##### Level 2 entities | | Train | Dev | Test | Total | |---------|-------|-----|-------|-------| | PER | 46 | 7 | 4 | 57 | | ORG | 180 | 31 | 12 | 223 | | LOC | 58 | 12 | 8 | 78 | | GPE | 745 | 160 | 101 | 1006 | | TITLE | 6 | 0 | 0 | 6 | | PROD | 3 | 0 | 0 | 3 | | EVENT | 5 | 0 | 0 | 5 | | DATE | 7 | 34 | 1 | 42 | | TIME | 0 | 0 | 0 | 0 | | PERCENT | 1 | 0 | 0 | 1 | | MONEY | 0 | 0 | 0 | 0 | | Total | 1051 | 126 | 244 | 1421 | ##### Level 3 entities | | Train | Dev | Test | Total | |---------|-------|-----|-------|-------| | PER | 1 | 0 | 0 | 1 | | ORG | 1 | 0 | 0 | 0 | | LOC | 0 | 1 | 0 | 1 | | GPE | 38 | 13 | 4 | 26 | | TITLE | 0 | 0 | 0 | 0 | | PROD | 0 | 0 | 0 | 0 | | EVENT | 0 | 0 | 0 | 0 | | DATE | 0 | 0 | 0 | 0 | | TIME | 0 | 0 | 0 | 0 | | PERCENT | 0 | 0 | 0 | 0 | | MONEY | 0 | 0 | 0 | 0 | | Total | 40 | 14 | 4 | 58 | ## BibTeX entry and citation info ``` @inproceedings{sirts-2023-estonian, title = "{E}stonian Named Entity Recognition: New Datasets and Models", author = "Sirts, Kairit", editor = {Alum{\"a}e, Tanel and Fishel, Mark}, booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)", month = may, year = "2023", address = "T{\'o}rshavn, Faroe Islands", publisher = "University of Tartu Library", url = "https://aclanthology.org/2023.nodalida-1.76", pages = "752--761", abstract = "This paper presents the annotation process of two Estonian named entity recognition (NER) datasets, involving the creation of annotation guidelines for labeling eleven different types of entities. In addition to the commonly annotated entities such as person names, organization names, and locations, the annotation scheme encompasses geopolitical entities, product names, titles/roles, events, dates, times, monetary values, and percents. The annotation was performed on two datasets, one involving reannotating an existing NER dataset primarily composed of news texts and the other incorporating new texts from news and social media domains. Transformer-based models were trained on these annotated datasets to establish baseline predictive performance. Our findings indicate that the best results were achieved by training a single model on the combined dataset, suggesting that the domain differences between the datasets are relatively small.", } ```

应用场景：