Data for Extremely sparse models of linkage disequilibrium in ancestrally diverse association studies
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/8157130
下载链接
链接失效反馈官方服务:
资源简介:
Data from Extremely sparse models of linkage disequilibrium in ancestrally diverse association studies (2023). This includes linkage disequilibrium graphical models (LDGMs) created from high-coverage 1000 Genomes Project sequencing data. This dataset consists of LDGM precision matrices, LDGM graphical models of SNPs, and lists of SNPs, all split into 1,361 approximately independent LD blocks across the genome. The dataset additionally contains genotype information from chromosomes 21 and 22, and inferred tree sequences of high coverage 1000 Genomes Project Data, summary statistics from four traits in the UK Biobank, and UK biobank correlation matrices from chromosomes 21 and 22. All genomic data is in the GRCh38 build.
The data can be cited as follows:
Pouria Salehi Nowbandegani, Anthony Wilder Wohns, Jenna L. Ballard, Eric S. Lander, Alex Bloemendal, Benjamin M. Neale, and Luke J. O’Connor. Extremely sparse models of linkage disequilibrium in ancestrally diverse association studies. Nat Genet. (2023) DOI: 10.1038/s41588-023-01487-8
The directory contains `.tar.gz` files, which can be extracted and unzipped with:
$ tar -xvf FILENAME.tar.gz
All LD block files are named by chromosome and start/end basepair coordinates.
1kg_nygc_trios_removed_All_pops_geno_ids_pops.csv: The file contains 5008 rows, 2 for each individual in the 1000 Genomes Project. Each row contains the individual ID of the 1000 genomes individual, and the ancestry group and continental ancestry group that individual was assigned to. Rows correspond to columns in `.genos` files.
AFR/AMR/EAS/EUR/SAS.precision.tar.gz: Precision matrices for the relevant ancestry group for each LD block. Edge lists contain one row for each non-zero entry of the precision matrix. There are no column names.
genos_chr21_22.tar.gz: for the 40 LD blocks on chromosomes 21-22, .genos files are 0/1 matrices, with dimension number-of-SNPs by number-of-samples . Each LD matrix contains one column for each row in the SNP list files, and one row for each row in the sample ID files.
ldgms.tar.gz: 1361 LDGMs (*.edgelist files). Edge lists contain one row for each non-zero entry of the LDGM adjacency matrix. There is one LDGM edge list for each LD block. Each row represents an edge, as a tuple (index_1, index_2, entry). For the LDGM adjacency matrices, the entry is the edge weight, where 0 represents a strong dependency and e.g. 6 represents a weak dependency.
snplists_GRch38positions.tar.gz: 1361 *.snplist files, each of which contains information on the SNPs in each LD block. Each SNP list is an n x 11 table (n = number of SNPs), one for each LD block. The columns are:
index: these non-unique indices, starting at zero, correspond to rows and columns of the LDGMs. There can be multiple SNPs for a single index, which occurs when the corresponding mutations occur on the same brick of the bricked tree sequence. SNPs with the same index have high (nearly perfect) LD.
anc_alleles: ancestral allele
deriv_alleles: derived allele
EUR: allele frequency of derived allele in EUR samples
EAS: allele frequency of derived allele in EAS samples
AMR: allele frequency of derived allele in AMR samples
SAS: allele frequency of derived allele in SAS samples
AFR: allele frequency of derived allele in AFR samples
site_ids: unique identifier of each SNP, mostly as RSIDs
position: GRCh38 position of SNP
swap: indicates strandness swap
ukb.tar: Correlation matrices and SNP lists for SNPs in the UK Biobank.
correlation_matrices/: Correlation matrices for SNPs in the UK biobank, computed by Weissbrod et al. 2020 Nat Genet and can be downloaded by following the instructions here.
snplists/: List of SNPs in the *.snplist format included in the UK Biobank
tree_seqs.tar: contains 22 tree sequences inferred by tsinfer from the 30x 1000 Genomes Project Data. Tree sequences can be unzipped with tszip.
Summary statistics: there are four summary statistics files, obtained from https://alkesgroup.broadinstitute.org/UKBB/, and computed by Loh et al. 2018 Nat Genet.
Phenotype
Heritability estimate
Effective sample size
Number of SNPs
Height
0.570
650K
12 Million
Body mass index
0.303
500K
12 Million
Cardiovascular disease
0.155
450K
12 Million
Type 2 diabetes
0.073
450K
12 Million
创建时间:
2023-08-07



