five

Names and abbreviations of 1KGP populations.

收藏
Figshare2026-03-16 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/_p_Names_and_abbreviations_of_1KGP_populations_p_/31759268
下载链接
链接失效反馈
官方服务:
资源简介:
Biobanks now contain genetic data from millions of individuals. Dimensionality reduction, visualization and clustering are standard when exploring data at these scales; while efficient and tractable methods exist for the first two, clustering remains challenging because of the many ways in which demography and sampling can affect structure. In practice, clustering is commonly performed by drawing shapes around dimensionally reduced data or assuming populations have “type” genomes or allele frequencies that represent a population. We propose to use dimensionality reduction with UMAP followed by clustering with HDBSCAN to identify sets of points forming relatively dense subsets in genotype space. The approach is fast, easy to implement, and integrates with existing pipelines. When applied to simulated data or data from three biobanks, the approach identifies groups of individuals enriched for shared features correlated with ancestry, including country of birth, ethnicity, and sampling location, without requiring strong assumptions about the number or size of clusters, or the sources of population structure. Because it does not rely on proximity to a specific point in genetic space, this topological approach can form clusters that continuously span long distances in genetic space. This can help distinguish admixed populations, which can exhibit wide ancestry variation within populations and overlap of ancestry proportions across populations. Such clusters can highlight and account for interpretable sources of genetic, demographic, or sampling heterogeneity in a dataset that would otherwise have required a range of specialized techniques. We illustrate how topological genetic strata can further help us understand structure within biobanks, evaluate distributions of genotypic and phenotypic data, examine polygenic score transferability, identify potential influential alleles, and perform quality control.
创建时间:
2026-03-16
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作