BNDSNER: A Large-scale Medical NER Corpus in Bangla

NIAID Data Ecosystem2026-05-10 收录

下载链接：

https://data.mendeley.com/datasets/g7x8fxdpxj

下载链接

链接失效反馈

官方服务：

资源简介：

This is a large scale corpus for medical NER in Bangla, containing almost 25k sentences with 100k+ tokens. The initial data was collected using python web scripting techniques (https://pypi.org/project/beautifulsoup4/) from varieties of sources. After that, we annotate with disease (D) and symptom (S) using IOB2 for our work. You can use the corpus as it is if you need only disease and symptom entities. However, you can customize the corpus by further annotation if you want to use for other entities like chemical, drugs, treatment, anatomy etc.

本数据集为面向孟加拉语（Bangla）的医疗命名实体识别（Named Entity Recognition，NER）大规模语料库，共包含近2.5万条语句与10万余个Token。原始数据通过Python网络爬虫技术，依托https://pypi.org/project/beautifulsoup4/库从多类数据源采集得到。本研究采用IOB2标注范式，对数据中的疾病（Disease，D）与症状（Symptom，S）实体进行标注。若仅需使用疾病与症状实体，可直接使用该语料库。若需将其用于化学物质、药物、治疗方案、解剖结构等其他实体的识别任务，可通过进一步标注实现语料库的定制化调整。

创建时间：

2025-10-21

5,000+

优质数据集

54 个

任务类型

进入经典数据集