B-NER
收藏DataCite Commons2023-02-23 更新2025-04-16 收录
下载链接:
https://ieee-dataport.org/documents/b-ner
下载链接
链接失效反馈官方服务:
资源简介:
Within the Natural Language Processing (NLP) framework, Named Entity Recognition (NER) is regarded as the basis for extracting key information to understand texts in any language. As Bangla is a highly inflectional, morphologically rich, and resource-scarce language, building a balanced NER corpus with large and diverse entities is a demanding task. However, previously developed Bangla NER systems are limited to recognizing only three familiar entities: person, location, and organization. To address this significant limitation, we introduce a novel Bangla NER dataset B-NER, which was created using 22,144 manually annotated Bangla sentences collected from Bangla newspapers and Bangla Wikipedia. This dataset includes a total of 9,895 unique words which were manually categorized into eight different entity types, such as a person, organization, event, artifact, time indicator, natural phenomenon, geopolitical entity, and geographical location. Inter-annotator agreement experiments were conducted to validate the quality of annotations performed by three annotators, resulting in a Kappa score of 0.82. In this paper, we provide an outline of the annotation guideline illustrated with examples, discuss the B-NER dataset properties, and present benchmark evaluations of the dataset. In order to demonstrate the superiority and balance of the B-NER dataset compared to other publicly available datasets, we conducted a cross-dataset analysis. This analysis involved training the model on the B-NER dataset and testing it on publicly accessible datasets. The results showed that the model trained on B-NER performed optimally. Furthermore, we performed exhaustive benchmark evaluations based on Bidirectional LSTM with fastText embeddings and sentence transformer models. Among these models, fine-tuned IndicBERT achieved noticeable results with a macro accuracy of 86%. This dataset and baseline results will be publicly available under a CC-BY 4.0 license in the CoNLL-2002 format to facilitate further research on Bangla NER.
提供机构:
IEEE DataPort
创建时间:
2023-02-23



