A real data-driven simulation strategy to select an imputation method for mixed-type trait data

DataONE2023-02-15 更新2025-08-02 收录

下载链接：

https://search.dataone.org/view/sha256:89ba4841bc7da9ed0fd24f2d4698a6c936b29d7f3205e8e2364aada4d95004b9

下载链接

链接失效反馈

官方服务：

资源简介：

Missing observations in trait datasets pose an obstacle for analyses in myriad biological disciplines. Considering the mixed results of imputation, the wide variety of available methods, and the varied structure of real trait datasets, a framework for selecting a suitable imputation method is advantageous. We invoked a real data-driven simulation strategy to select an imputation method for a given mixed-type (categorical, count, continuous) target dataset. Candidate methods included mean/mode imputation,Â k-nearest neighbour, random forests, and multivariate imputation by chained equations (MICE). Using a trait dataset of squamates (lizards and amphisbaenians; order: Squamata) as a target dataset, a complete-case dataset consisting of species with nearly completed information was formed for the imputation method selection. Missing data were induced by removing values from this dataset under different missingness mechanisms: missing completely at random (MCAR), missing at random (MAR), an..., CytochromeÂ cÂ oxidase subunit I (COI) sequence records were originally downloaded from The Barcode of Life Data System (BOLD) (Ratnasingham & Hebert, 2007) (sequence data available atÂ dx.doi.org/10.5883/DS-IMPMIX2).Â Data were filtered for records that have been identified to the species level. Additional quality control checks on the sequence data included trimming N and gap content from sequence ends and removing sequences with greater than 1% of internal N and/or gap content across their entire sequence length.Â TheÂ AlignTranslationÂ function from the R package âDECIPHERâ v. 2.18.1 (Wright, 2015, 2020)Â was used to perform a multiple sequence alignment on the COI sequences. Phylogenetic trees were built using RAxML v. 8 (Stamatakis, 2014). The model GTRGAMMAI was specified (option -m), and the alignments were partitioned based on codon position (option -q). Nuclear sequence data used for building the c-mos, RAG1 and multigene trees were obtained from a multigene alignment published in..., Alignment and phylogenetic trees may be opened and visualized by software capable of handling Newick and FASTA file formats.

创建时间：

2025-07-21

5,000+

优质数据集

54 个

任务类型

进入经典数据集