UzThemeLex Dataset: An Uzbek Thematic Lexicon for Domain Terminology and Weakly Supervised NER
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/jgfkz6493p
下载链接
链接失效反馈官方服务:
资源简介:
UzThemeLex is a curated Uzbek-language thematic lexicon dataset designed for domain terminology mining and weakly supervised named entity recognition (NER). The release contains 4,945 unique terminological entries organized into 3 top-level domains (Agronomy, Economics and Business, Law and Governance) and 30 subcategories. Each entry provides the Uzbek term in Latin script, a normalized form for matching, a paraphrased Uzbek definition, domain and subcategory labels, provenance pointers to authoritative sources, and lightweight quality-control signals (heuristic confidence score, review flag, ambiguity flag). Optional fields include aliases and example sentences.
The dataset is distributed in multiple formats to support both manual inspection and machine processing. It includes a flat CSV file and a multi-sheet Excel workbook, together with a data dictionary that documents all columns and label sets. For training and pipeline integration, the release also provides JSON/JSONL exports, taxonomy metadata, and ready-to-use pattern files for dictionary-based tagging and weak supervision (e.g., spaCy EntityRuler patterns). A validation script is included to help users verify schema consistency and detect formatting issues (e.g., residual Cyrillic characters and apostrophe normalization).
UzThemeLex can be used as (i) a domain dictionary for keyword-based classification and information extraction in Uzbek texts and (ii) a gazetteer for generating weak labels to train or fine-tune NER models. The resource is intended to support Uzbek NLP research and applied text analytics in agriculture, economics, and legal/governance domains.
创建时间:
2026-02-10



