five

B-NER

收藏
DataCite Commons2023-02-23 更新2025-04-16 收录
下载链接:
https://ieee-dataport.org/documents/b-ner
下载链接
链接失效反馈
官方服务:
资源简介:
Within the Natural Language Processing (NLP) framework, Named Entity Recognition (NER) is regarded as the basis for extracting key information to understand texts in any language. As Bangla is a highly inflectional, morphologically rich, and resource-scarce language, building a balanced NER corpus with large and diverse entities is a demanding task. However, previously developed Bangla NER systems are limited to recognizing only three familiar entities: person, location, and organization. To address this significant limitation, we introduce a novel Bangla NER dataset B-NER, which was created using 22,144 manually annotated Bangla sentences collected from Bangla newspapers and Bangla Wikipedia. This dataset includes a total of 9,895 unique words which were manually categorized into eight different entity types, such as a person, organization, event, artifact, time indicator, natural phenomenon, geopolitical entity, and geographical location. Inter-annotator agreement experiments were conducted to validate the quality of annotations performed by three annotators, resulting in a Kappa score of 0.82. In this paper, we provide an outline of the annotation guideline illustrated with examples, discuss the B-NER dataset properties, and present benchmark evaluations of the dataset. In order to demonstrate the superiority and balance of the B-NER dataset compared to other publicly available datasets, we conducted a cross-dataset analysis. This analysis involved training the model on the B-NER dataset and testing it on publicly accessible datasets. The results showed that the model trained on B-NER performed optimally. Furthermore, we performed exhaustive benchmark evaluations based on Bidirectional LSTM with fastText embeddings and sentence transformer models. Among these models, fine-tuned IndicBERT achieved noticeable results with a macro accuracy of 86%. This dataset and baseline results will be publicly available under a CC-BY 4.0 license in the CoNLL-2002 format to facilitate further research on Bangla NER. 
提供机构:
IEEE DataPort
创建时间:
2023-02-23
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作