Datasets and Models for Ontology-aligned Information Extraction in Portuguese Cultural Heritage
收藏Figshare2026-03-09 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/Datasets_and_Models_for_Ontology-aligned_Information_Extraction_in_Portuguese_Cultural_Heritage/28970633
下载链接
链接失效反馈官方服务:
资源简介:
Ontology-aligned Information Extraction in Portuguese Cultural HeritageThis repository contains datasets and fine-tuned models for ontology-aligned Named Entity Recognition (NER) and Relation Extraction (RE) in Portuguese cultural heritage archival documents. The datasets include annotations of entities and relations mapped to ArchOnto classes and properties, an ontology designed for archives.Named Entity RecognitionWe fine-tuned four BiLSTM-CRF model variants on a general-domain Portuguese dataset. These models, along with a transformer-based model, GLiNER, were evaluated on general-domain and domain-specific archival datasets.NER Datasets:ner/datasets/ner-domain-gen/train: General-domain Portuguese dataset with annotated entities mapped to ArchOnto classes for NER model fine-tuning;ner/datasets/ner-domain-gen/test: General-domain Portuguese dataset with annotated entities mapped to ArchOnto classes for NER evaluation;ner/datasets/ner-spec-human/: Domain-specific human-transcribed texts from 20th-century Portuguese archival documents annotated with ArchOnto classes;ner/datasets/ner-spec-ocr/: Domain-specific OCR-extracted texts from 20th-century Portuguese archival documents annotated with ArchOnto classes.NER Models:ner/models/flairel: BiLSTM-CRF model with FlairEL embeddings;ner/models/flairel+word2vec: BiLSTM-CRF model with FlairEL and Skip-gram Word2Vec embeddings;ner/models/flairbbp: BiLSTM-CRF model with FlairBBP embeddings;ner/models/flairbbp+word2vec: BiLSTM-CRF model with FlairBBP and Skip-gram Word2Vec embeddings.FlairEL embeddings available at: https://github.com/ericlief/language_models.FlairBBP embeddings available at: https://github.com/jneto04/ner-pt.Skip-gram Word2Vec embeddings available at: http://nilc.icmc.usp.br/nilc/index.php/repositorio-de-word-embeddings-do-nilc.Relation ExtractionWe evaluate the GLiREL model on general-domain and domain-specific archival Portuguese datasets.RE Datasets:re/datasets/ner-domain-gen/test: General-domain Portuguese dataset with annotated entities and relations mapped to ArchOnto classes and properties for RE evaluation;re/datasets/ner-spec-human/: Domain-specific human-transcribed texts from 20th-century Portuguese archival documents annotated with ArchOnto classes and properties;re/datasets/ner-spec-ocr/: Domain-specific OCR-extracted texts from 20th-century Portuguese archival documents annotated with ArchOnto classes and properties.
创建时间:
2026-03-09



