lfcc/ner_archive_pt
收藏数据集概述
- 任务类别: 词性标注 (token-classification)
- 语言: 葡萄牙语 (pt)
- 数据规模: 100K<n<1M
数据集描述
该数据集是通过整合来自多个葡萄牙档案馆的信息创建的。我们从这些档案馆收集数据,并对每个采集的语料库进行手动标注,标注的命名实体包括人名、地点、日期、职业和组织。最终的数据集是通过将所有单独的语料库合并成一个统一的语料库形成的,我们将其命名为“ner-archive-pt”。
引用信息
bibtex @Article{make4010003, AUTHOR = {Cunha, Luís Filipe and Ramalho, José Carlos}, TITLE = {NER in Archival Finding Aids: Extended}, JOURNAL = {Machine Learning and Knowledge Extraction}, VOLUME = {4}, YEAR = {2022}, NUMBER = {1}, PAGES = {42--65}, URL = {https://www.mdpi.com/2504-4990/4/1/3}, ISSN = {2504-4990}, ABSTRACT = {The amount of information preserved in Portuguese archives has increased over the years. These documents represent a national heritage of high importance, as they portray the country’s history. Currently, most Portuguese archives have made their finding aids available to the public in digital format, however, these data do not have any annotation, so it is not always easy to analyze their content. In this work, Named Entity Recognition solutions were created that allow the identification and classification of several named entities from the archival finding aids. These named entities translate into crucial information about their context and, with high confidence results, they can be used for several purposes, for example, the creation of smart browsing tools by using entity linking and record linking techniques. In order to achieve high result scores, we annotated several corpora to train our own Machine Learning algorithms in this context domain. We also used different architectures, such as CNNs, LSTMs, and Maximum Entropy models. Finally, all the created datasets and ML models were made available to the public with a developed web platform, NER@DI.}, DOI = {10.3390/make4010003} }




