uznlp-uz/uz_edbench

Name: uznlp-uz/uz_edbench
Creator: uznlp-uz
Published: 2026-03-19 12:24:08
License: 暂无描述

Hugging Face2026-03-19 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/uznlp-uz/uz_edbench

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 language: - uz pretty_name: UZ-EDBench v1.0 size_categories: - 1K<n<10K task_categories: - text-classification configs: - config_name: default default: true data_files: - split: train path: UZ-EDBench.tsv sep: "\t" - config_name: doctors data_files: - split: train path: UZ-EDBench.Doctors.tsv sep: "\t" --- # Uzbek Medical Entity Benchmark (UZ-EDBench) ## 📌 Description **UZ-EDBench** is a structured Uzbek-language dataset designed for **medical entity recognition and classification** in a low-resource setting. The dataset is distributed in **TSV format** and consists of annotated tokens and domain-specific entity labels. It includes: * a main annotated corpus (`UZ-EDBench.tsv`) * a structured list of medical specialists (`UZ-EDBench.Doctors.tsv`) This dataset addresses the lack of: * Uzbek medical NLP benchmarks * annotated corpora for clinical entity extraction * structured taxonomies of medical specialists --- ## 🧠 Task Definition The dataset supports: ### 1. Named Entity Recognition (NER) * **Input:** tokenized Uzbek text * **Output:** entity labels (BIO format) ### 2. Entity Classification * **Input:** token / span * **Output:** entity type (medical category) --- ## 📊 Dataset Structure ### 🔹 Main File: `UZ-EDBench.tsv` * Format: **TSV (tab-separated)** * Each row = one token * Sentence boundaries may be separated by empty lines Typical format: ```tsv id="a1xk9d" token label Bemor O kardiolog B-DOCTOR_TYPE qabuliga O keldi O ``` --- ### 🔹 Auxiliary File: `UZ-EDBench.Doctors.tsv` This file contains structured information about **medical specialists (doctor types)** used in annotation. Typical structure: ```tsv id="b8j2md" doctor_type description kardiolog Yurak kasalliklari bo‘yicha mutaxassis nevrolog Asab tizimi mutaxassisi ``` --- ## 🏷 Tagset (Entity Labels) The dataset uses a **domain-specific BIO tagging scheme**. ### 🔹 Core Medical Entities | Tag | Description | | ------------------------- | --------------------- | | B-DISEASE / I-DISEASE | Kasallik nomlari | | B-SYMPTOM / I-SYMPTOM | Belgilar (simptomlar) | | B-DRUG / I-DRUG | Dori vositalari | | B-TREATMENT / I-TREATMENT | Davolash usullari | | B-TEST / I-TEST | Tibbiy tekshiruvlar | | B-ANATOMY / I-ANATOMY | Tana qismlari | --- ### 🔹 Doctor Types (Shifokor turlari) | Tag | Description | | ----------------------------- | -------------------- | | B-DOCTOR_TYPE / I-DOCTOR_TYPE | Tibbiy mutaxassislik | Examples include: * kardiolog * terapevt * nevrolog * pediatr * jarroh * dermatolog The full list is provided in: 👉 `UZ-EDBench.Doctors.tsv` --- ### 🔹 BIO Tagging Scheme | Tag | Meaning | | ----- | ------------------- | | B-XXX | Beginning of entity | | I-XXX | Inside entity | | O | Outside entity | Example: ```text id="v3n9sj" kardiolog B-DOCTOR_TYPE shifokor I-DOCTOR_TYPE ``` --- ## 🧾 Example ```text id="g4u7sl" Token Label Bemor O nevrolog B-DOCTOR_TYPE qabuliga O bosh B-ANATOMY og‘rig‘i B-SYMPTOM bilan O keldi O ``` --- ## 📏 Evaluation Protocol Recommended metrics: * Precision * Recall * F1-score (entity-level) * Token-level accuracy Evaluation should follow the **CoNLL NER standard**. --- ## 📊 Data Splits *Predefined splits are not included.* Recommended split: * Train: 80% * Validation: 10% * Test: 10% --- ## 🎯 Use Cases * 🏥 Uzbek medical NER systems * 🤖 Fine-tuning transformer models (BERT, RoBERTa, Qwen, etc.) * 📊 Clinical text mining * 🧠 Healthcare AI assistants * 🔍 Information extraction from Uzbek medical text --- # UZ-EDBench This repository contains two tab-separated subsets: - `default`: the main triage benchmark in `UZ-EDBench.tsv` - `doctors`: the doctor label reference table in `UZ-EDBench.Doctors.tsv` ## ⚙️ Loading the Dataset ```python id="y1p3z7" from datasets import load_dataset main = load_dataset("ruhilloalaev/uz_edbench", name="default") doctors = load_dataset("ruhilloalaev/uz_edbench", name="doctors") ``` --- ## ⚠️ Notes * Data is in **Uzbek (Latin script)** * Format: **TSV (tab-separated)** * Domain: **medical / healthcare** Text may include: * morphological variation * domain-specific terminology * spelling inconsistencies --- ## 📜 License This dataset is released under the **CC-BY-4.0 License**.

提供机构：

uznlp-uz

5,000+

优质数据集

54 个

任务类型

进入经典数据集