TWNERTC
收藏arXiv2017-02-09 更新2024-06-21 收录
下载链接:
http://dx.doi.org/10.17632/cdcztymf4k.1
下载链接
链接失效反馈官方服务:
资源简介:
TWNERTC数据集是由华为土耳其研发中心创建的,用于支持土耳其语的命名实体识别(NER)和文本分类(TC)研究。该数据集包含从维基百科自动分类和标注的句子,总计约696785条。通过使用图爬虫算法从Freebase中提取相关实体和领域信息,构建了包含约300K实体的大规模地名词典。数据集的创建过程中,引入了两种新的特定内容噪声减少方法,并将细粒度实体类型映射到四个粗粒度类型:人、地点、组织和其他。数据集的应用领域主要集中在提高土耳其语NER和TC任务的自动化水平,解决手动构建数据集的困难。
The TWNERTC Dataset was developed by Huawei Turkey R&D Center to support research on Turkish named entity recognition (NER) and text classification (TC). This dataset contains automatically classified and annotated sentences sourced from Wikipedia, with a total of approximately 696,785 entries. We constructed a large-scale geographic gazetteer encompassing approximately 300,000 entities by extracting relevant entity and domain information from Freebase via graph crawler algorithms. During the dataset development process, two novel targeted content noise reduction methods were introduced, and fine-grained entity types were mapped to four broad categories: Person, Location, Organization, and Other. The main application domains of this dataset are centered on improving the automation performance of Turkish NER and TC tasks, as well as addressing the challenges associated with manually constructing datasets.
提供机构:
华为土耳其研发中心
创建时间:
2017-02-08



