TibNER:Tibetan Named Entity Recognition Dataset
收藏DataCite Commons2025-04-27 更新2025-04-16 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=0cb7427b4933474f817fc028cc038af0
下载链接
链接失效反馈官方服务:
资源简介:
Structured linguistic resources are an important foundation for natural language processing. Currently, due to the lack of open-source datasets, the research on Tibetan named entity recognition progresses slowly and the results accumulate less. Based on this, this paper semi-automatically constructs a Tibetan named entity recognition dataset (TibNER) using an entity dictionary. In order to ensure the quality of the dataset, the automatic annotation results are manually proofread.TibNER contains 20,096 sentences, with an average sentence length of 44.2069 syllables, and the annotated entities include names of people, places, and organizations, with a total number of 43,678 in the three types of entities.In order to validate the validity of the dataset, this paper conducts a comparative test on three types of mainstream sequence annotation models, with an F1 value of up to 80.60%. After the study, this data provides data construction experience for low-resource languages, and provides certain data basis for studies such as Tibetan named entity recognition.
提供机构:
Science Data Bank
创建时间:
2024-05-24
搜集汇总
数据集介绍

背景与挑战
背景概述
TibNER是一个针对低资源藏语构建的命名实体识别数据集,通过基于实体词典的半自动方法创建并经过人工校对,包含20,096个句子和43,678个人名、地名、组织名实体标注。该数据集在主流序列标注模型上验证有效,最高F1值达80.60%,为藏语自然语言处理研究提供了重要数据基础。
以上内容由遇见数据集搜集并总结生成



