uznlp-uz/uz_medner
收藏Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/uznlp-uz/uz_medner
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- uz
pretty_name: UZ-MedNER v1.0
size_categories:
- 1K<n<10K
task_categories:
- text-classification
configs:
- config_name: default
default: true
data_files:
- split: train
path: UzMedNER.tsv
sep: "\t"
- config_name: tagset
data_files:
- split: train
path: tagset.tsv
sep: "\t"
---
# Uzbek Medical NER Dataset (UzMedNER)
## 📌 Description
This dataset introduces **UzMedNER**, a structured Named Entity Recognition (NER) resource for the Uzbek language in the **medical domain**. It is designed to support token-level sequence labeling tasks and facilitate research in low-resource biomedical NLP.
The dataset consists of manually annotated Uzbek text where each token is labeled using a predefined tagset representing medical and related entity types.
UzMedNER addresses the lack of:
* domain-specific annotated corpora in Uzbek
* standardized NER benchmarks for medical text
* resources for training sequence labeling models in low-resource settings
---
## 🧠 Task Definition
This dataset is designed for:
### Named Entity Recognition (NER)
* **Input:** tokenized Uzbek sentence
* **Output:** sequence of entity labels (BIO tagging scheme)
Example:
```text
Bemor B-DISEASE diabet I-DISEASE bilan O kasallangan O .
```
---
## 📊 Dataset Structure
The dataset is stored in **TSV format** with token-level annotations.
Typical format:
```tsv
token label
Bemor O
diabet B-DISEASE
bilan O
kasallangan O
```
* Each row = one token
* Labels follow **BIO tagging scheme**
* Sentences are separated by empty lines
---
## 🏷 Tagset (Entity Types)
The dataset uses a BIO-based tagging scheme with the following entity categories:
| Tag | Description |
| ------------------------- | -------------------------- |
| B-DISEASE / I-DISEASE | Disease names |
| B-SYMPTOM / I-SYMPTOM | Symptoms |
| B-DRUG / I-DRUG | Medications |
| B-TREATMENT / I-TREATMENT | Medical treatments |
| B-ANATOMY / I-ANATOMY | Body parts |
| B-TEST / I-TEST | Medical tests |
| O | Outside (non-entity token) |
> Note: Exact tag inventory is defined in the accompanying `tagset.tsv` file.
---
## 🧾 Example
```text
Token Label
Bemor O
yurak B-ANATOMY
og‘rig‘i B-SYMPTOM
bilan O
shifoxonaga O
murojaat O
qildi O
```
---
## 📏 Evaluation Protocol
Recommended evaluation metrics:
* Precision
* Recall
* F1-score (entity-level)
* Token-level accuracy
Evaluation should follow standard **CoNLL NER evaluation**.
---
## 📊 Data Splits
*Note: predefined splits may be added in future versions.*
Recommended split:
* Train: 80%
* Validation: 10%
* Test: 10%
---
## 🎯 Use Cases
This dataset can be used for:
* 🏥 Medical NER in Uzbek
* 🤖 Fine-tuning transformer models (BERT, RoBERTa, Qwen, etc.)
* 📊 Sequence labeling research
* 🔍 Clinical text mining
* 🧠 Biomedical NLP for low-resource languages
---
## ⚙️ Loading the Dataset
```python
from datasets import load_dataset
dataset = load_dataset("ruhilloalaev/UzMedNER", "default")
```
---
## ⚠️ Notes
* Data is in **Uzbek (Latin script)**
* Annotation follows **BIO scheme**
* Domain: **medical / clinical language**
* Some entities may exhibit:
* morphological variation
* spelling inconsistencies
* domain-specific abbreviations
---
## 📜 License
This dataset is released under the **CC-BY-4.0 License**.
提供机构:
uznlp-uz



