uznlp-uz/uz_edbench
收藏Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/uznlp-uz/uz_edbench
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- uz
pretty_name: UZ-EDBench v1.0
size_categories:
- 1K<n<10K
task_categories:
- text-classification
configs:
- config_name: default
default: true
data_files:
- split: train
path: UZ-EDBench.tsv
sep: "\t"
- config_name: doctors
data_files:
- split: train
path: UZ-EDBench.Doctors.tsv
sep: "\t"
---
# Uzbek Medical Entity Benchmark (UZ-EDBench)
## 📌 Description
**UZ-EDBench** is a structured Uzbek-language dataset designed for **medical entity recognition and classification** in a low-resource setting. The dataset is distributed in **TSV format** and consists of annotated tokens and domain-specific entity labels.
It includes:
* a main annotated corpus (`UZ-EDBench.tsv`)
* a structured list of medical specialists (`UZ-EDBench.Doctors.tsv`)
This dataset addresses the lack of:
* Uzbek medical NLP benchmarks
* annotated corpora for clinical entity extraction
* structured taxonomies of medical specialists
---
## 🧠 Task Definition
The dataset supports:
### 1. Named Entity Recognition (NER)
* **Input:** tokenized Uzbek text
* **Output:** entity labels (BIO format)
### 2. Entity Classification
* **Input:** token / span
* **Output:** entity type (medical category)
---
## 📊 Dataset Structure
### 🔹 Main File: `UZ-EDBench.tsv`
* Format: **TSV (tab-separated)**
* Each row = one token
* Sentence boundaries may be separated by empty lines
Typical format:
```tsv id="a1xk9d"
token label
Bemor O
kardiolog B-DOCTOR_TYPE
qabuliga O
keldi O
```
---
### 🔹 Auxiliary File: `UZ-EDBench.Doctors.tsv`
This file contains structured information about **medical specialists (doctor types)** used in annotation.
Typical structure:
```tsv id="b8j2md"
doctor_type description
kardiolog Yurak kasalliklari bo‘yicha mutaxassis
nevrolog Asab tizimi mutaxassisi
```
---
## 🏷 Tagset (Entity Labels)
The dataset uses a **domain-specific BIO tagging scheme**.
### 🔹 Core Medical Entities
| Tag | Description |
| ------------------------- | --------------------- |
| B-DISEASE / I-DISEASE | Kasallik nomlari |
| B-SYMPTOM / I-SYMPTOM | Belgilar (simptomlar) |
| B-DRUG / I-DRUG | Dori vositalari |
| B-TREATMENT / I-TREATMENT | Davolash usullari |
| B-TEST / I-TEST | Tibbiy tekshiruvlar |
| B-ANATOMY / I-ANATOMY | Tana qismlari |
---
### 🔹 Doctor Types (Shifokor turlari)
| Tag | Description |
| ----------------------------- | -------------------- |
| B-DOCTOR_TYPE / I-DOCTOR_TYPE | Tibbiy mutaxassislik |
Examples include:
* kardiolog
* terapevt
* nevrolog
* pediatr
* jarroh
* dermatolog
The full list is provided in:
👉 `UZ-EDBench.Doctors.tsv`
---
### 🔹 BIO Tagging Scheme
| Tag | Meaning |
| ----- | ------------------- |
| B-XXX | Beginning of entity |
| I-XXX | Inside entity |
| O | Outside entity |
Example:
```text id="v3n9sj"
kardiolog B-DOCTOR_TYPE
shifokor I-DOCTOR_TYPE
```
---
## 🧾 Example
```text id="g4u7sl"
Token Label
Bemor O
nevrolog B-DOCTOR_TYPE
qabuliga O
bosh B-ANATOMY
og‘rig‘i B-SYMPTOM
bilan O
keldi O
```
---
## 📏 Evaluation Protocol
Recommended metrics:
* Precision
* Recall
* F1-score (entity-level)
* Token-level accuracy
Evaluation should follow the **CoNLL NER standard**.
---
## 📊 Data Splits
*Predefined splits are not included.*
Recommended split:
* Train: 80%
* Validation: 10%
* Test: 10%
---
## 🎯 Use Cases
* 🏥 Uzbek medical NER systems
* 🤖 Fine-tuning transformer models (BERT, RoBERTa, Qwen, etc.)
* 📊 Clinical text mining
* 🧠 Healthcare AI assistants
* 🔍 Information extraction from Uzbek medical text
---
# UZ-EDBench
This repository contains two tab-separated subsets:
- `default`: the main triage benchmark in `UZ-EDBench.tsv`
- `doctors`: the doctor label reference table in `UZ-EDBench.Doctors.tsv`
## ⚙️ Loading the Dataset
```python id="y1p3z7"
from datasets import load_dataset
main = load_dataset("ruhilloalaev/uz_edbench", name="default")
doctors = load_dataset("ruhilloalaev/uz_edbench", name="doctors")
```
---
## ⚠️ Notes
* Data is in **Uzbek (Latin script)**
* Format: **TSV (tab-separated)**
* Domain: **medical / healthcare**
Text may include:
* morphological variation
* domain-specific terminology
* spelling inconsistencies
---
## 📜 License
This dataset is released under the **CC-BY-4.0 License**.
提供机构:
uznlp-uz



