kaiserahmed/BanglaTag-NER-Dataset
收藏Hugging Face2026-04-29 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/kaiserahmed/BanglaTag-NER-Dataset
下载链接
链接失效反馈官方服务:
资源简介:
BanglaTag NER数据集是一个孟加拉语(Bangla)命名实体识别数据集,包含22,144个句子,使用BIO标记方案标注了13种实体类型。数据集格式为每行一个JSON对象,包含tokens和ner_tags字段。实体类型包括人名(PER)、地点(LOC)、组织(ORG)、政治实体(POL)、日期(DATE)、时间(TIME)、事件(EVENT)、犯罪相关实体(CRIME)、头衔(TITLE)、数字(NUM)、符号(SYMBOL)、选区(CONSTITUENCY)和机构(INST)。数据集统计信息显示总句子数为22,144,其中训练集占85%(约18,822句),验证集和测试集各占7.5%(约1,661句)。数据集还提供了标签列表和使用示例,包括通过HuggingFace的load_dataset函数加载或手动加载JSONL文件的方法。此外,README还提到了一个基于该数据集微调的NER模型,并提供了引用信息。
The BanglaTag NER Dataset is a Bengali (Bangla) Named Entity Recognition dataset containing 22,144 sentences labeled using the BIO tagging scheme across 13 entity types. Each line in the dataset is a JSON object with tokens and ner_tags fields. The entity types include PER (person names), LOC (locations, cities, countries), ORG (organizations, companies), POL (political entities/parties), DATE (calendar dates), TIME (times of day), EVENT (named events), CRIME (crime-related entities), TITLE (titles, designations), NUM (numbers, quantities), SYMBOL (symbols, currencies), CONSTITUENCY (electoral constituencies), and INST (institutions). The dataset statistics show a total of 22,144 sentences, with 85% (~18,822) in the training set, and 7.5% each (~1,661) in the validation and test sets. The README also provides a label list, usage examples (including loading via HuggingFaces load_dataset function or manually loading JSONL files), and mentions a fine-tuned NER model based on this dataset. Citation information is also included.
提供机构:
kaiserahmed



