five

TibNER:Tibetan Named Entity Recognition Dataset

收藏
DataCite Commons2025-04-27 更新2025-04-16 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=0cb7427b4933474f817fc028cc038af0
下载链接
链接失效反馈
官方服务:
资源简介:
Structured linguistic resources are an important foundation for natural language processing. Currently, due to the lack of open-source datasets, the research on Tibetan named entity recognition progresses slowly and the results accumulate less. Based on this, this paper semi-automatically constructs a Tibetan named entity recognition dataset (TibNER) using an entity dictionary. In order to ensure the quality of the dataset, the automatic annotation results are manually proofread.TibNER contains 20,096 sentences, with an average sentence length of 44.2069 syllables, and the annotated entities include names of people, places, and organizations, with a total number of 43,678 in the three types of entities.In order to validate the validity of the dataset, this paper conducts a comparative test on three types of mainstream sequence annotation models, with an F1 value of up to 80.60%. After the study, this data provides data construction experience for low-resource languages, and provides certain data basis for studies such as Tibetan named entity recognition.
提供机构:
Science Data Bank
创建时间:
2024-05-24
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
TibNER是一个针对低资源藏语构建的命名实体识别数据集,通过基于实体词典的半自动方法创建并经过人工校对,包含20,096个句子和43,678个人名、地名、组织名实体标注。该数据集在主流序列标注模型上验证有效,最高F1值达80.60%,为藏语自然语言处理研究提供了重要数据基础。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作