Data from: Challenges with using names to link digital biodiversity information
收藏DataCite Commons2025-06-01 更新2025-06-15 收录
下载链接:
https://datadryad.org/dataset/doi:10.5061/dryad.3160r
下载链接
链接失效反馈官方服务:
资源简介:
The need for a names-based cyber-infrastructure for digital biology is
based on the argument that scientific names serve as a standardized
metadata system that has been used consistently and near universally for
250 years. As we move towards data-centric biology, name-strings can be
called on to discover, index, manage, and analyze accessible digital
biodiversity information from multiple sources. Known impediments to the
use of scientific names as metadata include synonyms, homonyms,
mis-spellings, and the use of other strings as identifiers. We here
compare the name-strings in GenBank, Catalogue of Life (CoL), and the
Dryad Digital Repository (DRYAD) to assess the effectiveness of the
current names-management toolkit developed by Global Names to achieve
interoperability among distributed data sources. New tools that have been
used here include Parser (to break name-strings into component parts and
to promote the use of canonical versions of the names), a modified
TaxaMatch fuzzy-matcher (to help manage typographical, transliteration,
and OCR errors), and Cross-Mapper (to make comparisons among data sets).
The data sources include scientific names at multiple ranks; vernacular
(common) names; acronyms; strain identifiers and other surrogates
including idiosyncratic abbreviations and concatenations. About 40% of the
name-strings in GenBank are scientific names representing about 400,000
species or infraspecies and their synonyms. Of the formally-named terminal
taxa (species and lower taxa) represented, about 82% have a match in CoL.
Using a subset of content in DRYAD, about 45% of the identifiers are names
of species and infraspecies, and of these only about a third have a match
in CoL. With simple processing, the extent of matching between DRYAD and
CoL can be improved to over 90%. The findings confirm the necessity for
name-processing tools and the value of scientific names as a mechanism to
interconnect distributed data, and identify specific areas of improvement
for taxonomic data sources. Some areas of diversity (bacteria and viruses)
are not well represented by conventional scientific names, and they and
other forms of strings (acronyms, identifiers, and other surrogates) that
are used instead of names need to be managed in reconciliation services
(mapping alternative name-strings for the same taxon together). On-line
resolution services will bring older scientific names up to date or
convert surrogate name-strings to scientific names should such names
exist. Examples are given of many of the aberrant forms of ‘names’ that
make their way into these databases. The occurrence of scientific names
with incorrect authors, such as chresonyms within synonymy lists, is a
quality-control issue in need of attention. We propose a future-proofing
solution that will empower stakeholders to take advantage of the
name-based infrastructure at little cost. This proposed infrastructure
includes a standardized system that adopts or creates UUIDs for
name-strings, software that can identify name-strings in sources and apply
the UUIDs, reconciliation and resolution services to manage the
name-strings, and an annotation environment for quality control by users
of name-strings.
提供机构:
Dryad
创建时间:
2016-05-20



