Dataset of Named Entity Recognition for Uzbek language
收藏doi.org2025-01-15 收录
下载链接:
http://doi.org/10.17632/xf7pyvhb2v.1
下载链接
链接失效反馈官方服务:
资源简介:
As part of the study, an annotated corpus of the Uzbek language was created for training and evaluating named entity recognition models. The corpus includes 2,000 sentences (25865 words) collected from various sources:
• Certain part of the data (Sentences from 1 to 154 in the Dataset) was extracted from the publicly available lex.uz database, which contains official texts that are highly literate and have a formal language structure.
• To increase the number of named entities in sentences and ensure diversity, author's sentences were developed containing several entities of different types. This enriched the corpus with complex structures and increased the efficiency of model training.
Data annotation was carried out manually using the BIOES scheme, which provides detailed marking of boundaries and types of named entities. All abstracts were reviewed by Uzbek language experts to ensure accuracy and consistency of data.
在本项研究中,为训练和评估命名实体识别模型,构建了一个乌兹别克语标注语料库。该语料库包含来自不同来源的2,000个句子(25,865个单词):
• 语料库中的一部分数据(从1至154句)来源于公开可用的lex.uz数据库,其中包含具有高度文学素养和正式语言结构的官方文本。
• 为了增加句子中命名实体的数量并确保多样性,作者编制了包含多种类型实体的句子,从而丰富了语料库的复杂结构并提升了模型训练的效率。
数据标注采用BIOES方案进行,该方案提供了命名实体边界和类型的详细标注。所有摘要均由乌兹别克语语言专家审阅,以确保数据的准确性和一致性。
提供机构:
Mendeley Data



