UzbekPOS: Multi-domain Part-Of-Speech Dataset for the Uzbek Language
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/55f889ncnx
下载链接
链接失效反馈官方服务:
资源简介:
UzbekPOS: Multi-domain Part-Of-Speech Dataset for the Uzbek Language
UzbekPOS is a manually annotated, multi-domain Part-of-Speech (POS) tagging dataset for the Uzbek language, created to support research and development in Natural Language Processing (NLP), computational linguistics, and corpus linguistics. Uzbek is a morphologically rich and under-resourced Turkic language, and this dataset addresses the lack of large-scale, high-quality annotated resources for fundamental linguistic tasks.
The dataset contains 4,412 sentences and 53113 token–tag pairs, collected from 25 diverse domains, including literature, news, science, education, law, medicine, technology, social interaction, and public discourse. This wide domain coverage ensures linguistic, stylistic, and topical diversity, making the corpus suitable for both academic research and applied NLP systems.
All sentences were manually tokenized and POS-tagged by expert annotators using a carefully designed tagset based on the Universal Dependencies (UD) UPOS framework, with adaptations for Uzbek-specific grammatical features. The standard DET (Determiner) tag was omitted due to the absence of articles in Uzbek, and a language-specific MOD (Modal) tag was introduced to better capture Uzbek functional grammar. The final tagset consists of 16 POS tags.
To guarantee high annotation quality, each sentence was processed through a three-stage validation pipeline: initial annotation, independent cross-verification by a second expert, and final adjudication by a senior linguist in cases of disagreement. This process ensures the dataset represents a gold-standard POS resource.
The UzbekPOS dataset is distributed in multiple widely used formats to maximize accessibility and reuse:
Raw annotated text (.txt) with / structure
Tab-Separated Values (.tsv) for easy inspection and processing
JSON Lines (.jsonl) for scalable programmatic use
CoNLL-U (.conllu) format, fully compatible with UD-based NLP tools
In addition, predefined train, development, and test splits are included to support standardized benchmarking and reproducible experiments.
UzbekPOS can be used for:
Training and evaluating POS taggers for Uzbek
Morphological and syntactic analysis
Cross-lingual and typological studies of Turkic languages
Transfer learning and low-resource NLP research
Educational purposes in NLP and corpus linguistics
This dataset is one of the largest openly available POS-tagged corpora for Uzbek and provides a solid foundation for future Uzbek and Turkic language technology development.
创建时间:
2026-01-02



