Distances and their visualization in studies of spatial-temporal genetic variation using single nucleotide polymorphisms (SNPs)
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.4b8gthtkn
下载链接
链接失效反馈官方服务:
资源简介:
Distance measures are widely used for examining genetic structure in datasets that comprise many individuals scored for a very large number of attributes. Genotype datasets composed of single nucleotide polymorphisms (SNPs) typically contain bi-allelic scores for tens of thousands if not hundreds of thousands of loci.
We examine the application of distance measures to SNP genotypes and sequence tag presence-absences (SilicoDArT) and use real datasets and simulated data to illustrate pitfalls in the application of genetic distances and their visualization.
The datasets used to illustrate points in the associated review are provided here together with the R script used to analyse the data. Data are either simulated internal to this script or are SNP data generated as part of other studies and included as compressed binary files readily accessable by reading into R using R base function readRDS(). Refer to the analysis script for examples.
Methods
A dataset was constructed from a SNP matrix generated for the freshwater turtles in the genus Emydura, a recent radiation of Chelidae in Australasia. The dataset (SNP_starting_data.Rdata) includes selected populations that vary in level of divergence to encompass variation within species and variation between closely related species. Sampling localities with evidence of admixture between species were removed. Monomorphic loci were removed, and the data was filtered on call rate (>95%), repeatability (>99.5%) and read depth (5x < read depth < 50x). Where there was more than one SNP per sequence tag, only one was retained at random. The resultant dataset had 18,196 SNP loci scored for 381 individuals from 7 sampling localities or populations – Emydura victoriae [Ord River, NT, n=15], E. tanybaraga [Holroyd River, Qld, n=10], E. subglobosa worrelli [Daly River, NT, n=25], E. subglobosa subglobosa [Fly River, PNG, n=55], E. macquarii macquarii [Murray Darling Basin north, NSW/Qld, n=152], E. macquarii krefftii [Fitzroy River, Qld, n=39] and E. macquarii emmotti [Cooper Creek, Qld, n=85]. The missing data rate was 1.7%, subsequently imputed by nearest neighbour to yield a fully populated data matrix. The data are a subset of those published by Georges et al. (2018, Molecular Ecology 27:5195-5213) for illustrative purposes only. A companion SilicoDArT dataset (silicodart_starting_data.Rdata) is also included.
The above manipulations were performed in R package dartR. Principal Components Analysis was undertaken using the glPCA function of the R adegenet package (as implemented in dartR). Principal Coordinates Analysis was undertaken using the pcoa function in R package ape implemented in dartR.
To exemplify the effect of missing values on SNP visualisation using PCA, we simulated ten populations that reproduced over 200 non-overlapping generations. Simulated populations were placed in a linear series with low dispersal between adjacent populations (one disperser every ten generations). Each population had 100 individuals, of which 50 individuals were sampled at random. Genotypes were generated for 1,000 neutral loci on one chromosome. We then randomly selected 50% of genotypes and set them as missing data. Principal Components Analysis was undertaken using the glPCA function of the R adegenet package. The R script to implement this is provided (Supplementary_script_for_ms.R).
The data for the Australian Blue Mountains skink Eulamprus leuraensis were generated for 372 individuals collected from 17 swamps isolated to varying degrees in the Blue Mountains region of New South Wales. Tail snips were collected and stored in 95% ethanol. The tissue samples were digested with proteinase K overnight and DNA was extracted using a NucleoMag 96 Tissue Kit (MachereyNagel, Duren, Germany) coupled with NucleoMag SEP (Ref. 744900) to allow automated separation of high-quality DNA on a Freedom Evo robotic liquid handler (TECAN, Miinnedorf, Switzerland). SNP data were generated by the commercial service of Diversity Arrays Technology Pty Ltd (Canberra, Australia) using published protocols. A total of 13,496 loci were scored which reduced to 7,935 after filtering out secondary SNPs on the same sequence tag, filtering on reproducibility (threshold 0.99) and call rate (threshold 0.95), and removal of monomorphic loci. The resultant data (Eulamprus_filtered.Rdata) is used to demonstrate the impact of a substantial inversion on the outcomes of a PCA.
To test the effect of having closely related individuals (parents and offspring) on the PCoA pattern we ran a simulation using dartR, where we picked up two individuals to become the parents with 2-8 offspring. We ran a PCoA for all of the simulated cases. The R code used is included in the R script uploaded here.
Refer to the companion manuscript for links to the literature associated with the above techniques.
创建时间:
2024-01-15



