Cross Script Hindi-English NER Corpus (Hi-En-WP)
收藏arXiv2018-10-08 更新2024-08-06 收录
下载链接:
http://arxiv.org/abs/1810.03430v1
下载链接
链接失效反馈官方服务:
资源简介:
本研究介绍了Cross Script Hindi-English NER Corpus (Hi-En-WP),由印度新德里的贾米亚米利亚伊斯兰大学计算机工程系创建。该数据集包含2916条记录,主要从维基百科的类别页面中提取,专注于印度语境下的混合语言命名实体识别。数据集创建过程中,通过解析维基百科链接文本,自动提取并标记可能的命名实体。该数据集的应用领域包括社交媒体内容分析,旨在解决混合语言文本处理中的命名实体识别问题,特别是印度语言环境下的挑战。
This study introduces the Cross Script Hindi-English NER Corpus (Hi-En-WP), developed by the Department of Computer Engineering, Jamia Millia Islamia, New Delhi, India. This dataset contains 2916 records, which are primarily extracted from Wikipedia category pages, and focuses on code-mixed named entity recognition within the Indian context. During the dataset's construction, potential named entities were automatically extracted and annotated by parsing the text of Wikipedia links. The application domains of this dataset include social media content analysis, with the goal of resolving challenges in named entity recognition for code-mixed text processing, particularly those specific to the Indian linguistic environment.
提供机构:
计算机工程系,贾米亚米利亚伊斯兰大学,新德里,印度
创建时间:
2018-10-08



