UzThemeLex Dataset: An Uzbek Thematic Lexicon for Domain Terminology and Weakly Supervised NER
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/jgfkz6493p
下载链接
链接失效反馈官方服务:
资源简介:
UzThemeLex is a curated Uzbek-language thematic lexicon dataset designed for domain terminology mining and weakly supervised named entity recognition (NER). The release contains 4,945 unique terminological entries organized into 3 top-level domains (Agronomy, Economics and Business, Law and Governance) and 30 subcategories. Each entry provides the Uzbek term in Latin script, a normalized form for matching, a paraphrased Uzbek definition, domain and subcategory labels, provenance pointers to authoritative sources, and lightweight quality-control signals (heuristic confidence score, review flag, ambiguity flag). Optional fields include aliases and example sentences.
The dataset is distributed in multiple formats to support both manual inspection and machine processing. It includes a flat CSV file and a multi-sheet Excel workbook, together with a data dictionary that documents all columns and label sets. For training and pipeline integration, the release also provides JSON/JSONL exports, taxonomy metadata, and ready-to-use pattern files for dictionary-based tagging and weak supervision (e.g., spaCy EntityRuler patterns). A validation script is included to help users verify schema consistency and detect formatting issues (e.g., residual Cyrillic characters and apostrophe normalization).
UzThemeLex can be used as (i) a domain dictionary for keyword-based classification and information extraction in Uzbek texts and (ii) a gazetteer for generating weak labels to train or fine-tune NER models. The resource is intended to support Uzbek NLP research and applied text analytics in agriculture, economics, and legal/governance domains.
UzThemeLex是一款经精心编纂的乌兹别克语主题词典数据集,专为领域术语挖掘与弱监督命名实体识别(Named Entity Recognition,简称NER)任务打造。本次发布的数据集包含4945条唯一术语条目,划分为3个一级领域(农学、经济与商务、法律与治理)以及30个子类别。每条条目均提供拉丁字母书写的乌兹别克语术语、用于匹配的标准化形式、带释义的乌兹别克语定义、领域及子类别标签、指向权威来源的出处标注,以及轻量化质量控制标识(启发式置信度评分、审核标记、歧义标记)。可选字段包含别名与示例语句。
本数据集以多种格式发布,兼顾人工核查与机器处理需求。发布包包含扁平化逗号分隔值(Comma-Separated Values,简称CSV)文件与多工作表Excel工作簿,同时附带一份数据字典,用于说明所有字段与标签集合。为适配模型训练与流程集成,本次发布还提供JSON/JSONL格式导出文件、分类体系元数据,以及适用于基于词典标注与弱监督任务的预定义模式文件(例如spaCy EntityRuler模式)。此外还附带验证脚本,可协助用户校验数据模式一致性,并检测格式问题(例如残留西里尔字符与撇号规范化问题)。
UzThemeLex可应用于两类场景:(1)作为领域词典,用于乌兹别克语文本的关键词分类与信息抽取任务;(2)作为实体词典(Gazetteer),用于生成弱标签以训练或微调命名实体识别模型。本数据集旨在为乌兹别克语自然语言处理(Natural Language Processing,简称NLP)研究,以及农业、经济、法律与治理领域的实用文本分析提供支持。
创建时间:
2026-02-10



