English/Turkish Wikipedia Named-Entity Recognition and Text Categorization Dataset
收藏Mendeley Data2024-01-31 更新2024-06-26 收录
下载链接:
https://data.mendeley.com/datasets/cdcztymf4k
下载链接
链接失效反馈官方服务:
资源简介:
TWNERTC and EWNERTC are collections of automatically categorized and annotated sentences obtained from Turkish and English Wikipedia for named-entity recognition and text categorization. Firstly, we construct large-scale gazetteers by using a graph crawler algorithm to extract relevant entity and domain information from a semantic knowledge base, Freebase. The final gazetteers has 77 domains (categories) and more than 1000 fine-grained entity types for both languages. Turkish gazetteers contains approximately 300K named-entities and English gazetteers has approximately 23M named-entities. By leveraging large-scale gazetteers and linked Wikipedia articles, we construct TWNERTC and EWNERTC. Since the categorization and annotation processes are automated, the raw collections are prone to ambiguity. Hence, we introduce two noise reduction methodologies: (a) domain-dependent (b) domain-independent. We produce two different versions by post-processing raw collections. As a result of this process, we introduced 3 versions of TWNERTC and EWNERTC: (a) raw (b) domain-dependent post-processed (c) domain-independent post-processed. Turkish collections have approximately 700K sentences for each version (varies between versions), while English collections contain more than 7M sentences. We also introduce "Coarse-Grained NER" versions of the same datasets. We reduce fine-grained types into "organization", "person", "location" and "misc" by mapping each fine-grained type to the most similar coarse-grained version. Note that this process also eliminated many domains and fine-grained annotations due to lack of information for coarse-grained NER. Hence, "Coarse-Grained NER" labelled datasets contain only 25 domains and number of sentences are decreased compared to "Fine-Grained NER" versions. All processes are explained in our published white paper for Turkish; however, major methods (gazetteers creation, automatic categorization/annotation, noise reduction) do not change for English.
TWNERTC与EWNERTC是分别从土耳其语和英语维基百科获取的、经自动分类与标注的句子集,用于命名实体识别(named-entity recognition, NER)与文本分类任务。首先,我们借助图爬虫算法从语义知识库Freebase中提取相关实体与领域信息,构建大规模词典(gazetteers)。最终的词典涵盖77个领域(类别)以及两种语言各自的1000余种细粒度实体类型。土耳其语词典包含约30万个命名实体,英语词典则拥有约2300万个命名实体。
我们依托大规模词典与关联的维基百科文章,构建了TWNERTC与EWNERTC数据集。由于分类与标注流程均为自动化实现,原始数据集存在歧义性问题。为此,我们提出两种降噪方法:(a) 领域依赖型降噪;(b) 领域无关型降噪。通过对原始数据集进行后处理,我们得到了两个不同的版本。
经此流程,我们生成了TWNERTC与EWNERTC的三个版本:(a) 原始版;(b) 经领域依赖型降噪后处理版;(c) 经领域无关型降噪后处理版。土耳其语数据集每个版本约含70万条句子(不同版本间存在小幅波动),英语数据集则包含超过700万条句子。
我们还推出了上述数据集的"Coarse-Grained NER"版本。我们通过将每个细粒度实体类型映射至最相近的粗粒度类别,将细粒度类型归约为"组织""人物""地点"与"其他"四类。需注意,由于粗粒度NER任务缺乏部分类型的映射信息,该过程同时移除了诸多领域与细粒度标注。因此,"Coarse-Grained NER"标注数据集仅包含25个领域,且句子数量相较于"细粒度NER"版本有所减少。
所有流程细节已在我们发表的土耳其语白皮书中进行了说明,不过针对英语数据集的核心方法(词典构建、自动分类与标注、降噪处理)并未发生变化。
创建时间:
2024-01-31



