ehri-ner/ehri-ner-all
收藏EHRI-NER 数据集概述
数据集简介
EHRI-NER 数据集是一个多语言(捷克语、德语、英语、法语、匈牙利语、荷兰语、波兰语、斯洛伐克语、意第绪语)数据集,适用于训练与大屠杀相关文本的特定领域命名实体识别(NER)模型。该数据集由 EHRI 数字学术版(即 EHRI 在线版)中的所有可用可扩展标记语言(XML)文件转换而成,适合用于训练 NER 模型。
数据集详情
- 总令牌数: 505758
- 实体数量:
- 人物实体:5351
- 地点实体:9399
- 组织实体:1867
- 日期实体:2237
- 集中营实体:1229
- 犹太区实体:528
数据集描述
自2018年以来,EHRI 联盟支持了六个与大屠杀相关的数字学术版的开发和出版。每个版本通过单一网络界面提供不同 EHRI 合作伙伴机构持有的主题相关文件的数字访问,并使用数字工具解锁新的历史资料展示和浏览方式。这些资源被重新利用,转换成适合训练 NER 模型的数据集,被视为黄金标准。
标注格式
每个单词单独一行,每句话后有一个空行。标注遵循 conll2003 格式(IOB)。
实体类别
- 人物(PER)
- 地点(LOC)
- 组织(ORG)
- 日期(DATE)
- 集中营(CAMP)
- 犹太区(GHETTO)
数据集来源
该数据集源自 EHRI 在线版,这是一系列六个与大屠杀相关的数字学术版。
使用限制
该数据集源自一系列手动标注的数字学术版,最初目的并非提供用于训练 NER 模型的数据集。尽管我们认为这些资源质量高,适合用于此目的,但用户仍应注意,该数据集是重新利用的资源,并非专为此目的构建。
引用
BibTeX: bibtex @inproceedings{dermentzi_repurposing_2024, address = {Torino, Italy}, title = {Repurposing {Holocaust}-{Related} {Digital} {Scholarly} {Editions} to {Develop} {Multilingual} {Domain}-{Specific} {Named} {Entity} {Recognition} {Tools}}, url = {https://hal.science/hal-04547222}, abstract = {The European Holocaust Research Infrastructure (EHRI) aims to support Holocaust research by making information about dispersed Holocaust material accessible and interconnected through its services. Creating a tool capable of detecting named entities in texts such as Holocaust testimonies or archival descriptions would make it easier to link more material with relevant identifiers in domain-specific controlled vocabularies, semantically enriching it, and making it more discoverable. With this paper, we release EHRI-NER, a multilingual dataset (Czech, German, English, French, Hungarian, Dutch, Polish, Slovak, Yiddish) for Named Entity Recognition (NER) in Holocaust-related texts. EHRI-NER is built by aggregating all the annotated documents in the EHRI Online Editions and converting them to a format suitable for training NER models. We leverage this dataset to fine-tune the multilingual Transformer-based language model XLM-RoBERTa (XLM-R) to determine whether a single model can be trained to recognize entities across different document types and languages. The results of our experiments show that despite our relatively small dataset, in a multilingual experiment setup, the overall F1 score achieved by XLM-R fine-tuned on multilingual annotations is 81.5{ extbackslash}%. We argue that this score is sufficiently high to consider the next steps towards deploying this model.}, urldate = {2024-04-29}, booktitle = {{LREC}-{COLING} 2024 - {Joint} {International} {Conference} on {Computational} {Linguistics}, {Language} {Resources} and {Evaluation}}, publisher = {ELRA Language Resources Association (ELRA); International Committee on Computational Linguistics (ICCL)}, author = {Dermentzi, Maria and Scheithauer, Hugo}, month = may, year = {2024}, keywords = {Digital Editions, Holocaust Testimonies, Multilingual, Named Entity Recognition, Transfer Learning, Transformers}, }
APA: Dermentzi, M., & Scheithauer, H. (2024, May). Repurposing Holocaust-Related Digital Scholarly Editions to Develop Multilingual Domain-Specific Named Entity Recognition Tools. LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evaluation. HTRes@LREC-COLING 2024, Torino, Italy. https://hal.science/hal-04547222



