alexbrandsen/archaeo_ner_dutch
收藏Dutch Archaeology NER Dataset
概述
- 语言: 荷兰语
- 许可: 其他(Hippocratic License 3.0)
- 任务类别: 词性标注
- 数据集名称: Dutch Archaeology NER Dataset
数据集信息
-
特征:
- tokens: 字符串序列
- ner_tags: 序列标签,包含以下类别:
- 0: O
- 1: B-ART
- 2: I-ART
- 3: B-CON
- 4: I-CON
- 5: B-LOC
- 6: I-LOC
- 7: B-MAT
- 8: I-MAT
- 9: B-PER
- 10: I-PER
- 11: B-SPE
- 12: I-SPE
-
数据分割:
- fold1_train: 22150个样本,4490700字节
- fold1_validation: 5852个样本,1579488字节
- fold1_test: 5750个样本,1574291字节
- fold2_train: 22465个样本,4685070字节
- fold2_validation: 5431个样本,1379777字节
- fold2_test: 5865个样本,1579700字节
- fold3_train: 19560个样本,4762905字节
- fold3_validation: 8757个样本,1501653字节
- fold3_test: 5427个样本,1379769字节
- fold4_train: 17029个样本,4533412字节
- fold4_validation: 7963个样本,1609278字节
- fold4_test: 8755个样本,1501649字节
- fold5_train: 20039个样本,4460910字节
- fold5_validation: 5747个样本,1574155字节
- fold5_test: 7965个样本,1609342字节
-
数据集大小:
- 下载大小: 7478347字节
- 总大小: 38222099字节
配置
- 默认配置:
- 数据文件路径:
- fold1_train: data/fold1_train-*
- fold1_validation: data/fold1_validation-*
- fold1_test: data/fold1_test-*
- fold2_train: data/fold2_train-*
- fold2_validation: data/fold2_validation-*
- fold2_test: data/fold2_test-*
- fold3_train: data/fold3_train-*
- fold3_validation: data/fold3_validation-*
- fold3_test: data/fold3_test-*
- fold4_train: data/fold4_train-*
- fold4_validation: data/fold4_validation-*
- fold4_test: data/fold4_test-*
- fold5_train: data/fold5_train-*
- fold5_validation: data/fold5_validation-*
- fold5_test: data/fold5_test-*
- 数据文件路径:
标签
- ART: 文物(如 bijl, pijlpunt)
- MAT: 材料(如 vuursteen, ijzer)
- PER: 时间时期(如 Middeleeuwen, 400 v. Chr.)
- CON: 考古背景(如 greppel, beerput)
- LOC: 地点(如 Amsterdam, Oss)
- SPE: 物种(如 Betula nana, koe)
引用信息
@inproceedings{brandsen-etal-2020-creating, title = "Creating a Dataset for Named Entity Recognition in the Archaeology Domain", author = "Brandsen, Alex and Verberne, Suzan and Wansleeben, Milco and Lambers, Karsten", editor = "Calzolari, Nicoletta and B{e}chet, Fr{e}d{e}ric and Blache, Philippe and Choukri, Khalid and Cieri, Christopher and Declerck, Thierry and Goggi, Sara and Isahara, Hitoshi and Maegaard, Bente and Mariani, Joseph and Mazo, H{e}l{`e}ne and Moreno, Asuncion and Odijk, Jan and Piperidis, Stelios", booktitle = "Proceedings of the Twelfth Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2020.lrec-1.562", pages = "4573--4577", abstract = "In this paper, we present the development of a training dataset for Dutch Named Entity Recognition (NER) in the archaeology domain. This dataset was created as there is a dire need for semantic search within archaeology, in order to allow archaeologists to find structured information in collections of Dutch excavation reports, currently totalling around 60,000 (658 million words) and growing rapidly. To guide this search task, NER is needed. We created rigorous annotation guidelines in an iterative process, then instructed five archaeology students to annotate a number of documents. The resulting dataset contains { extasciitilde}31k annotations between six entity types (artefact, time period, place, context, species {&} material). The inter-annotator agreement is 0.95, and when we used this data for machine learning, we observed an increase in F1 score from 0.51 to 0.70 in comparison to a machine learning model trained on a dataset created in prior work. This indicates that the data is of high quality, and can confidently be used to train NER classifiers.", language = "English", ISBN = "979-10-95546-34-4", }



