lfcc/ner_archive_pt

Name: lfcc/ner_archive_pt
Creator: lfcc
Published: 2023-11-23 16:43:28
License: 暂无描述

Hugging Face2023-11-23 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/lfcc/ner_archive_pt

下载链接

链接失效反馈

官方服务：

资源简介：

--- task_categories: - token-classification language: - pt size_categories: - 100K<n<1M --- ### Dataset This dataset was created by consolidating information from various Portuguese Archives. We gathered data from these archives and subsequently performed manual annotation of each harvested corpus with Named Entities such as Person, Place, Date, Profession and Organization. The resulting dataset was formed by merging all the individual corpora into a unified corpus which we named "ner-archive-pt" and can be accessed at: http://ner.epl.di.uminho.pt/ ### Citation ```bibtex @Article{make4010003, AUTHOR = {Cunha, Luís Filipe and Ramalho, José Carlos}, TITLE = {NER in Archival Finding Aids: Extended}, JOURNAL = {Machine Learning and Knowledge Extraction}, VOLUME = {4}, YEAR = {2022}, NUMBER = {1}, PAGES = {42--65}, URL = {https://www.mdpi.com/2504-4990/4/1/3}, ISSN = {2504-4990}, ABSTRACT = {The amount of information preserved in Portuguese archives has increased over the years. These documents represent a national heritage of high importance, as they portray the country’s history. Currently, most Portuguese archives have made their finding aids available to the public in digital format, however, these data do not have any annotation, so it is not always easy to analyze their content. In this work, Named Entity Recognition solutions were created that allow the identification and classification of several named entities from the archival finding aids. These named entities translate into crucial information about their context and, with high confidence results, they can be used for several purposes, for example, the creation of smart browsing tools by using entity linking and record linking techniques. In order to achieve high result scores, we annotated several corpora to train our own Machine Learning algorithms in this context domain. We also used different architectures, such as CNNs, LSTMs, and Maximum Entropy models. Finally, all the created datasets and ML models were made available to the public with a developed web platform, NER@DI.}, DOI = {10.3390/make4010003} } ```

任务类别：词元分类（token-classification）语言：葡萄牙语（pt）样本规模：10万 < 样本量 < 100万 ### 数据集说明本数据集通过整合多家葡萄牙档案馆的馆藏信息构建而成。我们从上述档案馆采集数据后，针对所获取的每一个语料库开展人工标注，标注的命名实体（Named Entity）涵盖人物、地点、日期、职业与组织机构等类别。我们将所有独立语料库合并为统一语料库，以此构建最终数据集，并将其命名为「ner-archive-pt」，数据集访问地址为：http://ner.epl.di.uminho.pt/ ### 引用文献 bibtex @Article{make4010003, AUTHOR = {Cunha, Luís Filipe and Ramalho, José Carlos}, TITLE = {《档案检索工具中的命名实体识别：扩展版》}, JOURNAL = {《机器学习与知识抽取》(Machine Learning and Knowledge Extraction)}, VOLUME = {4}, YEAR = {2022}, NUMBER = {1}, PAGES = {42--65}, URL = {https://www.mdpi.com/2504-4990/4/1/3}, ISSN = {2504-4990}, ABSTRACT = {多年来，葡萄牙档案馆保存的信息量持续增长。这些档案文件承载着该国极具价值的历史遗产，详实记录了葡萄牙的发展历程。目前，多数葡萄牙档案馆已将其档案检索工具以数字化形式向公众开放，但这些数据未经过任何标注，致使其内容分析工作往往难度颇高。本研究针对档案检索工具开发了命名实体识别（Named Entity Recognition, NER）方案，可实现对其中多种命名实体的识别与分类。这些命名实体蕴含着与其上下文相关的关键信息，凭借较高的识别置信度，可应用于多种场景，例如借助实体链接与记录链接技术开发智能浏览工具。为获得优异的模型性能，我们标注了多组语料库，用于训练适配该领域的机器学习算法，同时采用了卷积神经网络（Convolutional Neural Networks, CNN）、长短期记忆网络（Long Short-Term Memory, LSTM）以及最大熵模型（Maximum Entropy Model）等多种架构。最终，我们通过开发的Web平台NER@DI向公众开放了所有构建的数据集与机器学习模型。}, DOI = {10.3390/make4010003} }

提供机构：

lfcc

原始信息汇总

数据集概述

任务类别: 词性标注 (token-classification)
语言: 葡萄牙语 (pt)
数据规模: 100K<n<1M

数据集描述

该数据集是通过整合来自多个葡萄牙档案馆的信息创建的。我们从这些档案馆收集数据，并对每个采集的语料库进行手动标注，标注的命名实体包括人名、地点、日期、职业和组织。最终的数据集是通过将所有单独的语料库合并成一个统一的语料库形成的，我们将其命名为“ner-archive-pt”。

引用信息

bibtex @Article{make4010003, AUTHOR = {Cunha, Luís Filipe and Ramalho, José Carlos}, TITLE = {NER in Archival Finding Aids: Extended}, JOURNAL = {Machine Learning and Knowledge Extraction}, VOLUME = {4}, YEAR = {2022}, NUMBER = {1}, PAGES = {42--65}, URL = {https://www.mdpi.com/2504-4990/4/1/3}, ISSN = {2504-4990}, ABSTRACT = {The amount of information preserved in Portuguese archives has increased over the years. These documents represent a national heritage of high importance, as they portray the country’s history. Currently, most Portuguese archives have made their finding aids available to the public in digital format, however, these data do not have any annotation, so it is not always easy to analyze their content. In this work, Named Entity Recognition solutions were created that allow the identification and classification of several named entities from the archival finding aids. These named entities translate into crucial information about their context and, with high confidence results, they can be used for several purposes, for example, the creation of smart browsing tools by using entity linking and record linking techniques. In order to achieve high result scores, we annotated several corpora to train our own Machine Learning algorithms in this context domain. We also used different architectures, such as CNNs, LSTMs, and Maximum Entropy models. Finally, all the created datasets and ML models were made available to the public with a developed web platform, NER@DI.}, DOI = {10.3390/make4010003} }

搜集汇总

数据集介绍

构建方式

在档案学与数字人文领域，葡萄牙档案资料蕴含丰富历史信息，但其数字化检索工具缺乏结构化标注。为应对这一挑战，ner_archive_pt数据集通过系统整合多个葡萄牙档案馆的检索辅助资料构建而成。研究团队从各档案源采集文本数据，随后对每个语料库进行人工命名实体标注，涵盖人物、地点、日期、职业及组织等实体类别。最终将多个独立语料库融合为统一语料，形成规模介于十万至百万标记之间的结构化数据集，为葡萄牙语档案文本的实体识别研究奠定基础。

使用方法

使用者可通过HuggingFace平台直接加载该数据集，适用于训练与评估葡萄牙语命名实体识别模型。在自然语言处理流程中，可依据标注的ner_tags序列进行序列标注任务建模，支持卷积神经网络、长短期记忆网络等架构的探索。该数据集亦可用于实体链接、记录关联等下游应用，促进档案资料的智能浏览与内容分析工具开发，推动数字人文领域的研究进展。

背景与挑战

背景概述

在数字人文与档案学交叉领域，葡萄牙语档案文献的命名实体识别研究具有重要价值。2022年，由Luís Filipe Cunha与José Carlos Ramalho等研究人员构建的ner_archive_pt数据集应运而生，旨在从葡萄牙各类档案馆的检索工具中自动抽取人物、地点、日期、职业及组织等关键实体。该数据集整合了多源档案语料，通过人工标注形成统一资源，为历史文献的智能分析与知识挖掘提供了结构化基础，推动了文化遗产数字化领域的算法创新与应用实践。

当前挑战

该数据集致力于解决档案文献中命名实体识别的领域挑战，包括古语词汇变异、实体边界模糊以及跨文档实体关联等复杂问题。在构建过程中，研究人员面临档案文本格式异构、标注一致性维护以及领域专业术语处理等困难，需通过多轮人工校验与模型迭代来提升数据质量与可靠性。

常用场景

经典使用场景

在葡萄牙语档案文献的数字化处理领域，ner_archive_pt数据集为命名实体识别任务提供了关键支持。该数据集整合了多个葡萄牙档案馆的查找工具文本，经过人工标注，涵盖了人物、地点、日期、职业和组织等实体类别。研究者通常利用该数据集训练和评估序列标注模型，如基于LSTM或Transformer的架构，以自动识别档案文献中的结构化信息，从而推动档案内容的智能解析与知识提取。

解决学术问题

该数据集有效应对了档案文献数字化中信息无标注的学术挑战。通过提供高质量的葡萄牙语命名实体标注语料，它支持了领域自适应命名实体识别方法的研究，解决了通用模型在档案文本上性能不足的问题。其意义在于促进了文化遗产数字化保护的技术发展，为历史文献的语义分析和知识组织提供了可靠的数据基础，推动了数字人文与自然语言处理的交叉研究。

实际应用

在实际应用中，ner_archive_pt数据集被用于构建智能档案浏览系统。基于该数据集训练的模型能够自动提取档案查找工具中的关键实体，进而支持实体链接和记录链接技术，实现档案内容的关联检索与可视化导航。这显著提升了档案馆藏的可访问性和利用率，为历史研究者、档案管理员及公众提供了高效的信息探索工具，助力文化遗产的数字化保存与传播。

数据集最近研究