five

UzbekPOS: Multi-domain Part-Of-Speech Dataset for the Uzbek Language

收藏
NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/55f889ncnx
下载链接
链接失效反馈
官方服务:
资源简介:
UzbekPOS: Multi-domain Part-Of-Speech Dataset for the Uzbek Language UzbekPOS is a manually annotated, multi-domain Part-of-Speech (POS) tagging dataset for the Uzbek language, created to support research and development in Natural Language Processing (NLP), computational linguistics, and corpus linguistics. Uzbek is a morphologically rich and under-resourced Turkic language, and this dataset addresses the lack of large-scale, high-quality annotated resources for fundamental linguistic tasks. The dataset contains 4,412 sentences and 53113 token–tag pairs, collected from 25 diverse domains, including literature, news, science, education, law, medicine, technology, social interaction, and public discourse. This wide domain coverage ensures linguistic, stylistic, and topical diversity, making the corpus suitable for both academic research and applied NLP systems. All sentences were manually tokenized and POS-tagged by expert annotators using a carefully designed tagset based on the Universal Dependencies (UD) UPOS framework, with adaptations for Uzbek-specific grammatical features. The standard DET (Determiner) tag was omitted due to the absence of articles in Uzbek, and a language-specific MOD (Modal) tag was introduced to better capture Uzbek functional grammar. The final tagset consists of 16 POS tags. To guarantee high annotation quality, each sentence was processed through a three-stage validation pipeline: initial annotation, independent cross-verification by a second expert, and final adjudication by a senior linguist in cases of disagreement. This process ensures the dataset represents a gold-standard POS resource. The UzbekPOS dataset is distributed in multiple widely used formats to maximize accessibility and reuse: Raw annotated text (.txt) with / structure Tab-Separated Values (.tsv) for easy inspection and processing JSON Lines (.jsonl) for scalable programmatic use CoNLL-U (.conllu) format, fully compatible with UD-based NLP tools In addition, predefined train, development, and test splits are included to support standardized benchmarking and reproducible experiments. UzbekPOS can be used for: Training and evaluating POS taggers for Uzbek Morphological and syntactic analysis Cross-lingual and typological studies of Turkic languages Transfer learning and low-resource NLP research Educational purposes in NLP and corpus linguistics This dataset is one of the largest openly available POS-tagged corpora for Uzbek and provides a solid foundation for future Uzbek and Turkic language technology development.
创建时间:
2026-01-02
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作