UNER English Corpus
收藏arXiv2022-12-14 更新2024-08-06 收录
下载链接:
http://arxiv.org/abs/2212.07162v1
下载链接
链接失效反馈官方服务:
资源简介:
UNER English Corpus是由萨格勒布大学人文与社会科学学院创建的一个英语数据集,用于通用命名实体识别。该数据集通过提取维基百科数据和元数据,并结合DBpedia信息,自动生成并评估。数据集包含3.3GB的文本,涉及172个文件夹和17,150个文件,其中8.9%的令牌是实体。数据集的创建过程涉及从维基百科提取文本和元数据,通过超链接识别DBpedia类,并将其转换为UNER类型和子类型。该数据集主要应用于多语言自然语言处理任务,旨在解决命名实体的识别和分类问题。
The UNER English Corpus is an English dataset developed by the Faculty of Humanities and Social Sciences, University of Zagreb, for general named entity recognition (NER). This dataset is automatically generated and evaluated by extracting Wikipedia data and metadata, and incorporating DBpedia information. The dataset contains 3.3 GB of text, spanning 172 folders and 17,150 files, with 8.9% of the tokens being entities. The dataset creation process involves extracting text and metadata from Wikipedia, identifying DBpedia classes via hyperlinks, and converting them into UNER types and subtypes. This dataset is primarily applied in multilingual natural language processing tasks, aiming to address the identification and classification of named entities.
提供机构:
人文与社会科学学院, 萨格勒布大学
创建时间:
2022-12-14



