Feature Enriched Dataset to Develop and Test Author Name Ambiguity Solutions

Name: Feature Enriched Dataset to Develop and Test Author Name Ambiguity Solutions
Creator: figshare
Published: 2021-02-13 08:59:31
License: 暂无描述

DataCite Commons2021-02-13 更新2024-07-28 收录

下载链接：

https://figshare.com/articles/dataset/Feature_Enriched_Dataset_to_Develop_and_Test_Author_Name_Ambiguity_Solutions/13651502/2

下载链接

链接失效反馈

官方服务：

资源简介：

Author name ambiguity problem stems from the fact that many individuals share identical names in the real world. It is a non-trivial problem, commonly faced by digital libraries who index and retrieve scholarly articles written by distinct authors, on demand. The problem occurs when multiple authors share a common name or when multiple name variations of an author appear in the publication records, as, digital libraries do not provide an efficient way to accurately identify publications of an author. Furthermore, lack of complete meta data information of an author and publication hinders the development of a generic algorithm that can solve the under discussing problem. Many data collections exists which are used to develop and test author name ambiguity problem, however, majority of them lack basic author as well as publication related meta data, are incomplete, inconsistent and often biased as they are curated keeping view certain scenario or feature. To the best of our knowledge, there exists no free data collection which is enriched with useful features and holds the potential to comprehensively develop and evaluate author name ambiguity resolving techniques. Considering this, we have prepared a dataset which is based on 14 ambiguous author name blocks/groups, where each block is composed of distinct authors sharing common name or multiple name variations. The data set covers more than eleven authors as well as publication related features, containing 7886 publication records in total, belonging to 137 distinct authors. Raw data set is extracted from multiple web sources, which is annotated and confirmed manually. After pre-processing the raw data, we have employed a user study-based evaluation method along with Cohen's Kappa Coefficient with 100 % agreement. We believe that this dataset can provide the research community an opportunity to work with a diverse and unbiased, useful feature enriched dataset to develop and test author name disambiguation related techniques, which can be generic with better performance when employed in real world scenarios.

作者姓名歧义问题（Author Name Ambiguity Problem）源于现实世界中多名个体共用同一姓名的客观现象。该问题极具挑战性，数字图书馆（Digital Libraries）在按需索引、检索不同作者撰写的学术论文时，常会遭遇此类问题。当多名作者共享同一姓名，或某一作者存在多种姓名变体出现在发表记录中时，该问题便会显现——由于数字图书馆无法提供高效途径来精准识别特定作者的学术成果。此外，作者与发表作品的完整元数据（Metadata）信息缺失，阻碍了可用于解决当前所讨论问题的通用算法的研发。目前已有诸多数据集被用于研发与测试作者姓名歧义问题的解决方案，但其中绝大多数存在诸多缺陷：缺少作者与发表作品相关的基础元数据、信息不完整且不一致，且由于其构建时仅针对特定场景或特征，往往带有偏向性。据我们所知，目前尚无兼具丰富有用特征、能够全面支撑作者姓名歧义消解技术研发与评估的免费公开数据集。有鉴于此，我们构建了一款基于14个歧义作者姓名块/组的数据集，每个块均由共用同一姓名或存在多种姓名变体的不同作者组成。该数据集涵盖十余项与作者及发表作品相关的特征，总计包含7886条发表记录，涉及137名不同作者。原始数据集从多个网络数据源提取，并经人工标注与确认。在对原始数据完成预处理后，我们采用了基于用户研究的评估方法，并结合科恩kappa系数（Cohen's Kappa Coefficient）实现了100%的标注一致性。我们认为，本数据集可为研究社区提供依托丰富特征、兼具多样性与无偏向性的数据集开展研发与测试的契机，助力开发可适配真实场景、泛化能力更强且性能更优的作者姓名消歧相关技术。

提供机构：

figshare

创建时间：

2021-01-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集