alexbrandsen/archaeo_ner_dutch

Name: alexbrandsen/archaeo_ner_dutch
Creator: alexbrandsen
Published: 2024-01-30 12:41:18
License: 暂无描述

Hugging Face2024-01-30 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/alexbrandsen/archaeo_ner_dutch

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - nl license: other task_categories: - token-classification pretty_name: Dutch Archaeology NER Dataset license_name: hippocratic-license-3.0 license_link: https://firstdonoharm.dev/version/3/0/full.md dataset_info: features: - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': O '1': B-ART '2': I-ART '3': B-CON '4': I-CON '5': B-LOC '6': I-LOC '7': B-MAT '8': I-MAT '9': B-PER '10': I-PER '11': B-SPE '12': I-SPE splits: - name: fold1_train num_bytes: 4490700 num_examples: 22150 - name: fold1_validation num_bytes: 1579488 num_examples: 5852 - name: fold1_test num_bytes: 1574291 num_examples: 5750 - name: fold2_train num_bytes: 4685070 num_examples: 22465 - name: fold2_validation num_bytes: 1379777 num_examples: 5431 - name: fold2_test num_bytes: 1579700 num_examples: 5865 - name: fold3_train num_bytes: 4762905 num_examples: 19560 - name: fold3_validation num_bytes: 1501653 num_examples: 8757 - name: fold3_test num_bytes: 1379769 num_examples: 5427 - name: fold4_train num_bytes: 4533412 num_examples: 17029 - name: fold4_validation num_bytes: 1609278 num_examples: 7963 - name: fold4_test num_bytes: 1501649 num_examples: 8755 - name: fold5_train num_bytes: 4460910 num_examples: 20039 - name: fold5_validation num_bytes: 1574155 num_examples: 5747 - name: fold5_test num_bytes: 1609342 num_examples: 7965 download_size: 7478347 dataset_size: 38222099 configs: - config_name: default data_files: - split: fold1_train path: data/fold1_train-* - split: fold1_validation path: data/fold1_validation-* - split: fold1_test path: data/fold1_test-* - split: fold2_train path: data/fold2_train-* - split: fold2_validation path: data/fold2_validation-* - split: fold2_test path: data/fold2_test-* - split: fold3_train path: data/fold3_train-* - split: fold3_validation path: data/fold3_validation-* - split: fold3_test path: data/fold3_test-* - split: fold4_train path: data/fold4_train-* - split: fold4_validation path: data/fold4_validation-* - split: fold4_test path: data/fold4_test-* - split: fold5_train path: data/fold5_train-* - split: fold5_validation path: data/fold5_validation-* - split: fold5_test path: data/fold5_test-* tags: - archaeology --- # Dutch Archaeology NER Dataset A selection of Dutch archaeology field reports, annotated by archaeology students from Leiden University. ## Labels The following labels are included: - ART, artefacts ('bijl', 'pijlpunt') - MAT, materials ('vuursteen', 'ijzer') - PER, time periods ('Middeleeuwen', '400 v. Chr.') - CON, archaeological contexts ('greppel','beerput') - LOC, locations ('Amsterdam', 'Oss') - SPE, species ('Betula nana', 'koe') ## Folds The reason I supply 5 folds is because I get wildly different F1 scores between folds, and because it's important to keep whole documents in folds: these are long documents, any document that's split between train and test instantly leads to a higher F1, as the model starts recognising specific tokens as entities, leading to overfitting. A micro average F1 over 5 folds with no split documents seems like the fairest evaluation, closest to real-world inference. ### Citation Information ``` @inproceedings{brandsen-etal-2020-creating, title = "Creating a Dataset for Named Entity Recognition in the Archaeology Domain", author = "Brandsen, Alex and Verberne, Suzan and Wansleeben, Milco and Lambers, Karsten", editor = "Calzolari, Nicoletta and B{\'e}chet, Fr{\'e}d{\'e}ric and Blache, Philippe and Choukri, Khalid and Cieri, Christopher and Declerck, Thierry and Goggi, Sara and Isahara, Hitoshi and Maegaard, Bente and Mariani, Joseph and Mazo, H{\'e}l{\`e}ne and Moreno, Asuncion and Odijk, Jan and Piperidis, Stelios", booktitle = "Proceedings of the Twelfth Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2020.lrec-1.562", pages = "4573--4577", abstract = "In this paper, we present the development of a training dataset for Dutch Named Entity Recognition (NER) in the archaeology domain. This dataset was created as there is a dire need for semantic search within archaeology, in order to allow archaeologists to find structured information in collections of Dutch excavation reports, currently totalling around 60,000 (658 million words) and growing rapidly. To guide this search task, NER is needed. We created rigorous annotation guidelines in an iterative process, then instructed five archaeology students to annotate a number of documents. The resulting dataset contains {\textasciitilde}31k annotations between six entity types (artefact, time period, place, context, species {\&} material). The inter-annotator agreement is 0.95, and when we used this data for machine learning, we observed an increase in F1 score from 0.51 to 0.70 in comparison to a machine learning model trained on a dataset created in prior work. This indicates that the data is of high quality, and can confidently be used to train NER classifiers.", language = "English", ISBN = "979-10-95546-34-4", } ```

提供机构：

alexbrandsen

原始信息汇总

Dutch Archaeology NER Dataset

概述

语言: 荷兰语
许可: 其他（Hippocratic License 3.0）
任务类别: 词性标注
数据集名称: Dutch Archaeology NER Dataset

数据集信息

特征:
- tokens: 字符串序列
- ner_tags: 序列标签，包含以下类别:
  - 0: O
  - 1: B-ART
  - 2: I-ART
  - 3: B-CON
  - 4: I-CON
  - 5: B-LOC
  - 6: I-LOC
  - 7: B-MAT
  - 8: I-MAT
  - 9: B-PER
  - 10: I-PER
  - 11: B-SPE
  - 12: I-SPE
数据分割:
- fold1_train: 22150个样本，4490700字节
- fold1_validation: 5852个样本，1579488字节
- fold1_test: 5750个样本，1574291字节
- fold2_train: 22465个样本，4685070字节
- fold2_validation: 5431个样本，1379777字节
- fold2_test: 5865个样本，1579700字节
- fold3_train: 19560个样本，4762905字节
- fold3_validation: 8757个样本，1501653字节
- fold3_test: 5427个样本，1379769字节
- fold4_train: 17029个样本，4533412字节
- fold4_validation: 7963个样本，1609278字节
- fold4_test: 8755个样本，1501649字节
- fold5_train: 20039个样本，4460910字节
- fold5_validation: 5747个样本，1574155字节
- fold5_test: 7965个样本，1609342字节
数据集大小:
- 下载大小: 7478347字节
- 总大小: 38222099字节

配置

默认配置:
- 数据文件路径:
  - fold1_train: data/fold1_train-*
  - fold1_validation: data/fold1_validation-*
  - fold1_test: data/fold1_test-*
  - fold2_train: data/fold2_train-*
  - fold2_validation: data/fold2_validation-*
  - fold2_test: data/fold2_test-*
  - fold3_train: data/fold3_train-*
  - fold3_validation: data/fold3_validation-*
  - fold3_test: data/fold3_test-*
  - fold4_train: data/fold4_train-*
  - fold4_validation: data/fold4_validation-*
  - fold4_test: data/fold4_test-*
  - fold5_train: data/fold5_train-*
  - fold5_validation: data/fold5_validation-*
  - fold5_test: data/fold5_test-*

引用信息

@inproceedings{brandsen-etal-2020-creating, title = "Creating a Dataset for Named Entity Recognition in the Archaeology Domain", author = "Brandsen, Alex and Verberne, Suzan and Wansleeben, Milco and Lambers, Karsten", editor = "Calzolari, Nicoletta and B{e}chet, Fr{e}d{e}ric and Blache, Philippe and Choukri, Khalid and Cieri, Christopher and Declerck, Thierry and Goggi, Sara and Isahara, Hitoshi and Maegaard, Bente and Mariani, Joseph and Mazo, H{e}l{`e}ne and Moreno, Asuncion and Odijk, Jan and Piperidis, Stelios", booktitle = "Proceedings of the Twelfth Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2020.lrec-1.562", pages = "4573--4577", abstract = "In this paper, we present the development of a training dataset for Dutch Named Entity Recognition (NER) in the archaeology domain. This dataset was created as there is a dire need for semantic search within archaeology, in order to allow archaeologists to find structured information in collections of Dutch excavation reports, currently totalling around 60,000 (658 million words) and growing rapidly. To guide this search task, NER is needed. We created rigorous annotation guidelines in an iterative process, then instructed five archaeology students to annotate a number of documents. The resulting dataset contains { extasciitilde}31k annotations between six entity types (artefact, time period, place, context, species {&} material). The inter-annotator agreement is 0.95, and when we used this data for machine learning, we observed an increase in F1 score from 0.51 to 0.70 in comparison to a machine learning model trained on a dataset created in prior work. This indicates that the data is of high quality, and can confidently be used to train NER classifiers.", language = "English", ISBN = "979-10-95546-34-4", }

5,000+

优质数据集

54 个

任务类型

进入经典数据集

alexbrandsen/archaeo_ner_dutch

Dutch Archaeology NER Dataset

概述

数据集信息

配置

标签

引用信息