Data from: A real data-driven simulation strategy to select an imputation method for mixed-type trait data
收藏DataCite Commons2026-03-12 更新2025-04-10 收录
下载链接:
https://datadryad.org/dataset/doi:10.5061/dryad.crjdfn37m
下载链接
链接失效反馈官方服务:
资源简介:
Missing observations in trait datasets pose an obstacle for analyses in
myriad biological disciplines. Considering the mixed results of
imputation, the wide variety of available methods, and the varied
structure of real trait datasets, a framework for selecting a suitable
imputation method is advantageous. We invoked a real data-driven
simulation strategy to select an imputation method for a given mixed-type
(categorical, count, continuous) target dataset. Candidate methods
included mean/mode imputation, k-nearest neighbour, random
forests, and multivariate imputation by chained equations (MICE). Using a
trait dataset of squamates (lizards and amphisbaenians; order: Squamata)
as a target dataset, a complete-case dataset consisting of species with
nearly completed information was formed for the imputation method
selection. Missing data were induced by removing values from this dataset
under different missingness mechanisms: missing completely at random
(MCAR), missing at random (MAR), and missing not at random (MNAR). For
each method, combinations with and without phylogenetic information from
single gene (nuclear and mitochondrial) or multigene trees were used to
impute the missing values for five numerical and two categorical traits.
The performances of the methods were evaluated under each missing
mechanism by determining the mean squared error and proportion falsely
classified rates for numerical and categorical traits, respectively. A
random forest method supplemented with a nuclear-derived phylogeny
resulted in the lowest error rates for the majority of traits, and this
method was used to impute missing values in the original dataset. Data
with imputed values better reflected the characteristics and distributions
of the original data compared to complete-case data. However, caution
should be taken when imputing trait data as phylogeny did not always
improve performance for every trait and in every scenario. Ultimately,
these results support the use of a real data-driven simulation strategy
for selecting a suitable imputation method for a given mixed-type trait
dataset.
提供机构:
Dryad
创建时间:
2023-02-15



