five

uznlp-uz/uz_medner

收藏
Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/uznlp-uz/uz_medner
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - uz pretty_name: UZ-MedNER v1.0 size_categories: - 1K<n<10K task_categories: - text-classification configs: - config_name: default default: true data_files: - split: train path: UzMedNER.tsv sep: "\t" - config_name: tagset data_files: - split: train path: tagset.tsv sep: "\t" --- # Uzbek Medical NER Dataset (UzMedNER) ## 📌 Description This dataset introduces **UzMedNER**, a structured Named Entity Recognition (NER) resource for the Uzbek language in the **medical domain**. It is designed to support token-level sequence labeling tasks and facilitate research in low-resource biomedical NLP. The dataset consists of manually annotated Uzbek text where each token is labeled using a predefined tagset representing medical and related entity types. UzMedNER addresses the lack of: * domain-specific annotated corpora in Uzbek * standardized NER benchmarks for medical text * resources for training sequence labeling models in low-resource settings --- ## 🧠 Task Definition This dataset is designed for: ### Named Entity Recognition (NER) * **Input:** tokenized Uzbek sentence * **Output:** sequence of entity labels (BIO tagging scheme) Example: ```text Bemor B-DISEASE diabet I-DISEASE bilan O kasallangan O . ``` --- ## 📊 Dataset Structure The dataset is stored in **TSV format** with token-level annotations. Typical format: ```tsv token label Bemor O diabet B-DISEASE bilan O kasallangan O ``` * Each row = one token * Labels follow **BIO tagging scheme** * Sentences are separated by empty lines --- ## 🏷 Tagset (Entity Types) The dataset uses a BIO-based tagging scheme with the following entity categories: | Tag | Description | | ------------------------- | -------------------------- | | B-DISEASE / I-DISEASE | Disease names | | B-SYMPTOM / I-SYMPTOM | Symptoms | | B-DRUG / I-DRUG | Medications | | B-TREATMENT / I-TREATMENT | Medical treatments | | B-ANATOMY / I-ANATOMY | Body parts | | B-TEST / I-TEST | Medical tests | | O | Outside (non-entity token) | > Note: Exact tag inventory is defined in the accompanying `tagset.tsv` file. --- ## 🧾 Example ```text Token Label Bemor O yurak B-ANATOMY og‘rig‘i B-SYMPTOM bilan O shifoxonaga O murojaat O qildi O ``` --- ## 📏 Evaluation Protocol Recommended evaluation metrics: * Precision * Recall * F1-score (entity-level) * Token-level accuracy Evaluation should follow standard **CoNLL NER evaluation**. --- ## 📊 Data Splits *Note: predefined splits may be added in future versions.* Recommended split: * Train: 80% * Validation: 10% * Test: 10% --- ## 🎯 Use Cases This dataset can be used for: * 🏥 Medical NER in Uzbek * 🤖 Fine-tuning transformer models (BERT, RoBERTa, Qwen, etc.) * 📊 Sequence labeling research * 🔍 Clinical text mining * 🧠 Biomedical NLP for low-resource languages --- ## ⚙️ Loading the Dataset ```python from datasets import load_dataset dataset = load_dataset("ruhilloalaev/UzMedNER", "default") ``` --- ## ⚠️ Notes * Data is in **Uzbek (Latin script)** * Annotation follows **BIO scheme** * Domain: **medical / clinical language** * Some entities may exhibit: * morphological variation * spelling inconsistencies * domain-specific abbreviations --- ## 📜 License This dataset is released under the **CC-BY-4.0 License**.
提供机构:
uznlp-uz
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作