five

Dataset of Uzbek language NER (3000+)

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://data.mendeley.com/datasets/7d59mk8xp5
下载链接
链接失效反馈
官方服务:
资源简介:
As part of the study, an annotated corpus of the Uzbek language was created for training and evaluating named entity recognition models. The corpus includes 3,053 sentences (34911 words) collected from various sources: • Legislative acts and legal documents: The bulk of the data was extracted from the publicly available lex.uz database, which contains official texts that are highly literate and have a formal language structure. • News sites: Articles and materials from Uzbek news portals (kun.uz, gazeta.uz) were used, which made it possible to include modern language structures and relevant vocabulary. • Manually created sentences: To increase the number of named entities in sentences and ensure diversity, author's sentences were developed containing several entities of different types. This enriched the corpus with complex structures and increased the efficiency of model training. Data annotation was carried out manually using the BIOES scheme, which provides detailed marking of boundaries and types of named entities. All abstracts were reviewed by Uzbek language experts to ensure accuracy and consistency of data.
创建时间:
2025-03-18
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作