ehri-ner/ehri-ner-all

Name: ehri-ner/ehri-ner-all
Creator: ehri-ner
Published: 2024-04-29 14:10:11
License: 暂无描述

Hugging Face2024-04-29 更新2024-06-11 收录

下载链接：

https://hf-mirror.com/datasets/ehri-ner/ehri-ner-all

下载链接

链接失效反馈

官方服务：

资源简介：

EHRI-NER数据集是一个多语言数据集，适用于训练与大屠杀相关的文本中的命名实体识别（NER）模型。该数据集包含了捷克语、德语、英语、法语、匈牙利语、荷兰语、波兰语、斯洛伐克语和意第绪语等多种语言的文本。数据集来源于EHRI在线版，这些在线版是由EHRI联盟开发和发布的六种与大屠杀相关的数字学术版本。数据集中的实体类别包括人物、地点、组织、日期、集中营和犹太人区等。每个单词都被放在单独的一行，句子之间有空行，注释遵循conll2003格式（IOB）。

提供机构：

ehri-ner

原始信息汇总

EHRI-NER 数据集概述

数据集简介

EHRI-NER 数据集是一个多语言（捷克语、德语、英语、法语、匈牙利语、荷兰语、波兰语、斯洛伐克语、意第绪语）数据集，适用于训练与大屠杀相关文本的特定领域命名实体识别（NER）模型。该数据集由 EHRI 数字学术版（即 EHRI 在线版）中的所有可用可扩展标记语言（XML）文件转换而成，适合用于训练 NER 模型。

数据集详情

总令牌数： 505758
实体数量：
- 人物实体：5351
- 地点实体：9399
- 组织实体：1867
- 日期实体：2237
- 集中营实体：1229
- 犹太区实体：528

数据集描述

自2018年以来，EHRI 联盟支持了六个与大屠杀相关的数字学术版的开发和出版。每个版本通过单一网络界面提供不同 EHRI 合作伙伴机构持有的主题相关文件的数字访问，并使用数字工具解锁新的历史资料展示和浏览方式。这些资源被重新利用，转换成适合训练 NER 模型的数据集，被视为黄金标准。

标注格式

每个单词单独一行，每句话后有一个空行。标注遵循 conll2003 格式（IOB）。

实体类别

人物（PER）
地点（LOC）
组织（ORG）
日期（DATE）
集中营（CAMP）
犹太区（GHETTO）

数据集来源

该数据集源自 EHRI 在线版，这是一系列六个与大屠杀相关的数字学术版。

使用限制

该数据集源自一系列手动标注的数字学术版，最初目的并非提供用于训练 NER 模型的数据集。尽管我们认为这些资源质量高，适合用于此目的，但用户仍应注意，该数据集是重新利用的资源，并非专为此目的构建。

引用

BibTeX: bibtex @inproceedings{dermentzi_repurposing_2024, address = {Torino, Italy}, title = {Repurposing {Holocaust}-{Related} {Digital} {Scholarly} {Editions} to {Develop} {Multilingual} {Domain}-{Specific} {Named} {Entity} {Recognition} {Tools}}, url = {https://hal.science/hal-04547222}, abstract = {The European Holocaust Research Infrastructure (EHRI) aims to support Holocaust research by making information about dispersed Holocaust material accessible and interconnected through its services. Creating a tool capable of detecting named entities in texts such as Holocaust testimonies or archival descriptions would make it easier to link more material with relevant identifiers in domain-specific controlled vocabularies, semantically enriching it, and making it more discoverable. With this paper, we release EHRI-NER, a multilingual dataset (Czech, German, English, French, Hungarian, Dutch, Polish, Slovak, Yiddish) for Named Entity Recognition (NER) in Holocaust-related texts. EHRI-NER is built by aggregating all the annotated documents in the EHRI Online Editions and converting them to a format suitable for training NER models. We leverage this dataset to fine-tune the multilingual Transformer-based language model XLM-RoBERTa (XLM-R) to determine whether a single model can be trained to recognize entities across different document types and languages. The results of our experiments show that despite our relatively small dataset, in a multilingual experiment setup, the overall F1 score achieved by XLM-R fine-tuned on multilingual annotations is 81.5{ extbackslash}%. We argue that this score is sufficiently high to consider the next steps towards deploying this model.}, urldate = {2024-04-29}, booktitle = {{LREC}-{COLING} 2024 - {Joint} {International} {Conference} on {Computational} {Linguistics}, {Language} {Resources} and {Evaluation}}, publisher = {ELRA Language Resources Association (ELRA); International Committee on Computational Linguistics (ICCL)}, author = {Dermentzi, Maria and Scheithauer, Hugo}, month = may, year = {2024}, keywords = {Digital Editions, Holocaust Testimonies, Multilingual, Named Entity Recognition, Transfer Learning, Transformers}, }

APA: Dermentzi, M., & Scheithauer, H. (2024, May). Repurposing Holocaust-Related Digital Scholarly Editions to Develop Multilingual Domain-Specific Named Entity Recognition Tools. LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evaluation. HTRes@LREC-COLING 2024, Torino, Italy. https://hal.science/hal-04547222

5,000+

优质数据集

54 个

任务类型

进入经典数据集