gwascatalog/associations
收藏Hugging Face2026-03-30 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/gwascatalog/associations
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
license_name: embl-ebi
license_link: https://www.ebi.ac.uk/about/terms-of-use/
language:
- en
tags:
- biology
- medical
---
# GWAS Catalog Associations
## Dataset Description
This dataset contains curated genetic association results from the [NHGRI-EBI GWAS Catalog](https://ebi.ac.uk/gwas), a manually curated resource of published genome-wide association studies (GWAS).
The dataset captures SNP–trait associations reported in peer-reviewed studies. Each row represents an association between a genetic variant (typically a single nucleotide polymorphism, SNP) and a disease or trait reported in a publication.
The GWAS catalog aggregates information from thousands of GWAS publications and standardises metadata about studies, genomic loci, variants, genes, and statistical significance.
This Hugging Face dataset provides a tabular representation of the association records suitable for downstream analysis, machine learning, and genomics research workflows.
### Dataset Summary
* Task categories: genomics, biomedical data mining
* Data type: tabular
* Primary domain: genome-wide association studies (GWAS)
* Unit of observation: SNP–trait association
* Source: curated literature database
Typical uses include:
* genomic risk analysis
* variant annotation pipelines
* phenotype–genotype relationship studies
* machine learning on genetic associations
* meta-analysis of GWAS findings
---
# Dataset Structure
Each row corresponds to a reported association between a variant and a trait.
## Columns
| Column | Description |
|--------------------------------|------------------------------------------------------------------------------------------------------------------------------|
| DATE ADDED TO CATALOG | Date the study was added to the GWAS Catalog. |
| PUBMEDID | PubMed identifier for the publication reporting the association. |
| FIRST AUTHOR | Last name and initials of the first author of the publication. |
| DATE | Publication date (online/epub date if available). |
| JOURNAL | Abbreviated journal name in which the study appeared. |
| LINK | URL linking to the publication record in PubMed. |
| STUDY | Title of the publication reporting the GWAS. |
| DISEASE/TRAIT | Disease or trait investigated in the study. |
| INITIAL SAMPLE DESCRIPTION | Sample size and ancestry description for Stage 1 GWAS discovery cohort. |
| REPLICATION SAMPLE DESCRIPTION | Sample size and ancestry description for replication cohorts used to validate associations. |
| REGION | Cytogenetic region associated with the SNP. |
| CHR_ID | Chromosome number containing the SNP. |
| CHR_POS | Chromosomal coordinate of the SNP. |
| REPORTED GENE(S) | Gene(s) reported by the study authors as associated with the SNP. |
| MAPPED GENE(S) | Gene(s) mapped to the SNP based on genomic position. If intergenic, the nearest upstream and downstream genes are reported. |
| UPSTREAM_GENE_ID | Entrez Gene ID of the closest upstream gene if the SNP lies outside a gene. |
| DOWNSTREAM_GENE_ID | Entrez Gene ID of the closest downstream gene if the SNP lies outside a gene. |
| SNP_GENE_IDS | Entrez Gene ID(s) if the SNP is located within a gene. Multiple IDs indicate overlapping transcripts. |
| UPSTREAM_GENE_DISTANCE | Distance in base pairs from the SNP to the nearest upstream gene if intergenic. |
| DOWNSTREAM_GENE_DISTANCE | Distance in base pairs from the SNP to the nearest downstream gene if intergenic. |
| STRONGEST SNP-RISK ALLELE | SNP most strongly associated with the trait and its risk allele (or haplotype if applicable). |
| SNPS | Identifier of the strongest SNP; may include multiple rsIDs for haplotypes. |
| MERGED | Indicates whether the SNP record has been merged with another rsID (0 = no, 1 = yes). |
| SNP_ID_CURRENT | Current rsID identifier when the original SNP has been merged. |
| CONTEXT | Predicted functional context of the variant (e.g., intronic, intergenic) based on Ensembl annotations. |
| INTERGENIC | Indicator for whether the SNP lies in an intergenic region (0 = no, 1 = yes). |
| RISK ALLELE FREQUENCY | Frequency of the risk allele among control individuals (or the largest control group if multiple are available). |
| P-VALUE | Reported p-value for the SNP association. Values are rounded to one significant digit. |
| PVALUE_MLOG | Negative log10 transformation of the p-value. |
| P-VALUE (TEXT) | Additional context about the p-value (e.g., subgroup analyses such as sex or smoking status). |
| OR or BETA | Reported odds ratio (OR) or beta coefficient associated with the risk allele. |
| 95% CI (TEXT) | Reported 95% confidence interval for the effect estimate. |
| PLATFORM (SNPS PASSING QC) | Genotyping platform used for Stage 1 GWAS, including notes on imputation or pooled designs where applicable. |
| CNV | Indicates whether the study involves copy number variation analysis (yes/no). |
| MAPPED_TRAIT | Mapped Experimental Factor Ontology trait for this study |
| MAPPED_TRAIT_URI | URI of the EFO trait |
| STUDY ACCESSION | Accession ID allocated to a GWAS Catalog study |
| GENOTYPING TECHNOLOGY | Genotyping technology/ies used in this study, with additional array information (ex. Immunochip or Exome array) in brackets. |
---
# Curation Process
The GWAS Catalog is curated through a combination of automated and manual processes:
1. Literature identification
* Publications describing genome-wide association studies are identified through literature searches and author submissions.
2. Manual curation
* Expert curators review publications and extract key information including:
* variant identifiers (e.g., rsIDs)
* associated traits or diseases
* statistical significance metrics
* effect sizes
* sample descriptions
3. Standardisation
* Extracted data are normalized using standardized vocabularies and identifiers where possible, including:
* controlled trait terms, including ontology terms from the [Experimental Factor Ontology (EFO)](https://www.ebi.ac.uk/efo/)
* genomic coordinates
* gene identifiers
* standardised ancestry label [framework](https://link.springer.com/article/10.1186/s13059-018-1396-2)
4. Annotation
* Variants are annotated with additional genomic information such as:
* mapped genes
* variant context (e.g., intronic, intergenic)
* genomic distances to nearby genes
5. Quality control
* Curated records undergo internal quality checks to ensure consistency, correct variant identifiers, and valid genomic annotations.
For more information about the curation process, [please see our documentation](https://www.ebi.ac.uk/gwas/docs/methods)
The Hugging Face dataset mirrors the [tabular association records published by the GWAS Catalog on 2026-03-17.](https://www.ebi.ac.uk/gwas/docs/file-downloads)
---
# Bias, Limitations, and Population Representation
Genome-wide association studies have several well-known limitations that may affect analyses using this dataset.
## Population Bias
A large proportion of GWAS studies have historically been conducted with individuals genetically similar to European reference populations. Please note:
* genetic associations may not generalise across populations
* allele frequencies may differ substantially between ancestries
* effect sizes may vary across populations
Users should exercise caution when applying results derived from GWAS to diverse populations.
## Publication Bias
The catalog reflects published associations, which introduces potential bias:
* studies with statistically significant findings are more likely to be published
* null results are often underrepresented
* some loci may appear more frequently because they are studied more extensively
## Study Heterogeneity
GWAS included in the catalog differ in:
* sample size
* cohort composition
* genotyping platform
* statistical methodology
* phenotype definitions
These differences can influence reported effect sizes and significance levels.
---
# Summary statistics
This dataset includes only GWAS-significant associations.
Full summary statistics, including variants which fail to meet GWAS significance, [are available directly from the GWAS Catalog.](https://www.ebi.ac.uk/gwas/docs/methods/summary-statistics)
Summary statistics files in the GWAS Catalog undergo extensive quality control steps to improve their reusability.
---
# Credits
This dataset is derived from the [NHGRI-EBI GWAS Catalog](https://ebi.ac.uk/gwas).
We would like to thank:
* Authors who submit their data to the catalog, including full summary statistics
* Authors of the original GWAS publications included in the catalog
* GWAS Catalog team members, past and present
* Research participants who contributed data to the underlying genetic studies
---
# Citation
If you use this dataset in research, please cite the GWAS Catalog publication:
Maria Cerezo, Elliot Sollis, Yue Ji, Elizabeth Lewis, Ala Abid, Karatuğ Ozan Bircan, Peggy Hall, James Hayhurst, Sajo John, Abayomi Mosaku, Santhi Ramachandran, Amy Foreman, Arwa Ibrahim, James McLaughlin, Zoë Pendlington, Ray Stefancsik, Samuel A Lambert, Aoife McMahon, Joannella Morales, Thomas Keane, Michael Inouye, Helen Parkinson, Laura W Harris, The NHGRI-EBI GWAS Catalog: standards for reusability, sustainability and diversity, Nucleic Acids Research, Volume 53, Issue D1, 6 January 2025, Pages D998–D1005, https://doi.org/10.1093/nar/gkae1070
```bibtex
@article{cerezo2025nhgri,
title={The NHGRI-EBI GWAS Catalog: standards for reusability, sustainability and diversity},
author={Cerezo, Maria and Sollis, Elliot and Ji, Yue and Lewis, Elizabeth and Abid, Ala and Bircan, Karatu{\u{g}} Ozan and Hall, Peggy and Hayhurst, James and John, Sajo and Mosaku, Abayomi and others},
journal={Nucleic acids research},
volume={53},
number={D1},
pages={D998--D1005},
year={2025},
publisher={Oxford University Press}
}
```
---
# License
The NHGRI-EBI GWAS Catalog and all its contents are available under the [general Terms of Use for EMBL-EBI Services](https://ebi.ac.uk/about/terms-of-use). Summary statistics are made available under CC0 unless otherwise stated. We advise consumers of data hosted by the GWAS Catalog to note the license terms of individual datasets, if applicable to their specific use case.
许可证:其他
许可证名称:EMBL-EBI
许可证链接:https://www.ebi.ac.uk/about/terms-of-use/
语言:英语
标签:生物学、医学
# GWAS目录关联数据
## 数据集描述
本数据集包含来自[NHGRI-EBI GWAS目录(NHGRI-EBI GWAS Catalog)](https://ebi.ac.uk/gwas)的经过人工审编的遗传关联结果,该目录是针对已发表的全基因组关联研究(Genome-Wide Association Study, GWAS)的人工审编资源库。
本数据集收录了经同行评议的研究中报道的单核苷酸多态性(Single Nucleotide Polymorphism, SNP)-性状关联信息。每一行代表一个遗传变异(通常为单核苷酸多态性,即SNP)与文献中报道的疾病或性状之间的关联。
GWAS目录整合了数千篇GWAS文献的信息,并对研究、基因组位点、变异、基因以及统计学显著性相关的元数据进行了标准化处理。
本Hugging Face数据集提供了关联记录的表格化表示形式,适用于下游分析、机器学习以及基因组学研究工作流。
## 数据集概览
* 任务类别:基因组学、生物医学数据挖掘
* 数据类型:表格型数据
* 核心领域:全基因组关联研究(GWAS)
* 观测单元:SNP-性状关联
* 来源:人工审编的文献数据库
典型应用场景包括:
* 基因组风险分析
* 变异注释流程
* 表型-基因型关联研究
* 遗传关联的机器学习建模
* GWAS研究发现的荟萃分析
---
# 数据集结构
每一行对应一个已报道的变异与性状之间的关联。
## 字段说明
| 字段名称 | 字段描述 |
|------------------------------|---------------------------------------------------------------------------------------------------------------------------|
| DATE ADDED TO CATALOG | 研究被添加至GWAS目录的日期。 |
| PUBMEDID | 报道该关联的文献的PubMed标识符。 |
| FIRST AUTHOR | 文献第一作者的姓氏与姓名缩写。 |
| DATE | 文献发表日期(若可获取,则优先使用在线预印本日期)。 |
| JOURNAL | 发表该研究的期刊名称缩写。 |
| LINK | 指向PubMed中该文献记录的URL。 |
| STUDY | 报道该GWAS研究的文献标题。 |
| DISEASE/TRAIT | 该研究中探究的疾病或性状。 |
| INITIAL SAMPLE DESCRIPTION | 一期GWAS发现队列的样本量与祖先背景描述。 |
| REPLICATION SAMPLE DESCRIPTION | 用于验证关联的复制队列的样本量与祖先背景描述。 |
| REGION | 与该SNP相关的细胞遗传学区域。 |
| CHR_ID | 包含该SNP的染色体编号。 |
| CHR_POS | 该SNP的染色体坐标。 |
| REPORTED GENE(S) | 研究作者报告的与该SNP相关的基因。 |
| MAPPED GENE(S) | 根据基因组位置映射至该SNP的基因。若该变异位于基因间区,则报告最近的上游与下游基因。 |
| UPSTREAM_GENE_ID | 若该SNP位于基因外,则为最近上游基因的Entrez Gene ID。 |
| DOWNSTREAM_GENE_ID | 若该SNP位于基因外,则为最近下游基因的Entrez Gene ID。 |
| SNP_GENE_IDS | 若该SNP位于基因内,则为对应的Entrez Gene ID。若存在多个ID,则代表存在重叠转录本。 |
| UPSTREAM_GENE_DISTANCE | 若该SNP位于基因间区,则为该SNP至最近上游基因的碱基对距离。 |
| DOWNSTREAM_GENE_DISTANCE | 若该SNP位于基因间区,则为该SNP至最近下游基因的碱基对距离。 |
| STRONGEST SNP-RISK ALLELE | 与该性状关联最强的SNP及其风险等位基因(若适用,也可为单倍型)。 |
| SNPS | 该最强SNP的标识符;单倍型可包含多个rsID。 |
| MERGED | 指示该SNP记录是否已与其他rsID合并(0 = 未合并,1 = 已合并)。 |
| SNP_ID_CURRENT | 当原始SNP已被合并时,其当前使用的rsID标识符。 |
| CONTEXT | 基于Ensembl注释预测的变异功能区域(例如内含子区、基因间区)。 |
| INTERGENIC | 指示该SNP是否位于基因间区(0 = 否,1 = 是)。 |
| RISK ALLELE FREQUENCY | 对照个体中风险等位基因的频率(若存在多个对照组,则取最大对照组的数据)。 |
| P-VALUE | 该SNP关联的报告p值。数值已四舍五入至一位有效数字。 |
| PVALUE_MLOG | p值的负以10为底的对数值转换结果。 |
| P-VALUE (TEXT) | 有关p值的额外背景信息(例如按性别或吸烟状态划分的亚组分析)。 |
| OR or BETA | 与该风险等位基因相关的报告比值比(OR)或β系数。 |
| 95% CI (TEXT) | 效应估计值的95%置信区间(文本形式)。 |
| PLATFORM (SNPS PASSING QC) | 一期GWAS使用的基因分型平台,可包含插补或混合设计的相关说明。 |
| CNV | 是否涉及拷贝数变异分析(是/否)。 |
| MAPPED_TRAIT | 本研究对应的实验因子本体论(Experimental Factor Ontology, EFO)映射性状。 |
| MAPPED_TRAIT_URI | 该EFO性状的统一资源标识符(URI)。 |
| STUDY ACCESSION | GWAS目录分配给该研究的收录标识符。 |
| GENOTYPING TECHNOLOGY | 本研究使用的基因分型技术,括号内可包含芯片相关信息,例如免疫芯片或外显子组芯片。 |
---
# 审编流程
GWAS目录通过自动化与人工结合的流程完成审编:
1. 文献识别
* 通过文献检索与作者提交的方式,识别报道全基因组关联研究的文献。
2. 人工审编
* 专业审编人员审阅文献并提取关键信息,包括:
* 变异标识符(例如rsID)
* 关联的性状或疾病
* 统计学显著性指标
* 效应量
* 样本描述
3. 标准化处理
* 提取的数据尽可能使用标准化词汇与标识符进行归一化处理,包括:
* 受控性状术语,包括来自[实验因子本体论(EFO)](https://www.ebi.ac.uk/efo/)的本体论术语
* 基因组坐标
* 基因标识符
* 标准化祖先标签框架[(框架链接)](https://link.springer.com/article/10.1186/s13059-018-1396-2)
4. 注释
* 为变异添加额外的基因组学信息注释,例如:
* 映射基因
* 变异区域(例如内含子区、基因间区)
* 与邻近基因的基因组距离
5. 质量控制
* 审编后的记录需通过内部质量检查,以确保一致性、正确的变异标识符以及有效的基因组注释。
如需了解更多审编流程信息,[请参阅官方文档](https://www.ebi.ac.uk/gwas/docs/methods)。本Hugging Face数据集镜像了[GWAS目录于2026年3月17日发布的表格化关联记录](https://www.ebi.ac.uk/gwas/docs/file-downloads)。
---
# 偏倚、局限性与人群代表性
全基因组关联研究存在多项已知局限性,可能会影响使用本数据集开展的分析。
## 人群偏倚
历史上,绝大多数GWAS研究的研究对象均为遗传背景与欧洲参考人群相似的个体。请注意:
* 遗传关联结果可能无法跨人群推广
* 不同祖先人群的等位基因频率可能存在显著差异
* 效应量可能因人群而异
使用者在将GWAS推导得到的结果应用于多样化人群时应谨慎行事。
## 发表偏倚
本目录收录的均为已发表的关联结果,这可能引入潜在偏倚:
* 具有统计学显著性发现的研究更易被发表
* 零结果研究往往代表性不足
* 部分位点可能因被更广泛地研究而出现频率更高
## 研究异质性
目录收录的GWAS研究在以下方面存在差异:
* 样本量
* 队列构成
* 基因分型平台
* 统计方法学
* 表型定义
这些差异可能会影响报告的效应量与显著性水平。
---
# 汇总统计数据
本数据集仅包含达到GWAS显著性阈值的关联结果。
包含未达到GWAS显著性阈值的变异在内的完整汇总统计数据,[可直接从GWAS目录获取](https://www.ebi.ac.uk/gwas/docs/methods/summary-statistics)。
GWAS目录中的汇总统计数据文件会经过严格的质量控制步骤,以提升其可复用性。
---
# 致谢
本数据集源自[NHGRI-EBI GWAS目录](https://ebi.ac.uk/gwas)。
我们谨向以下各方致谢:
* 向目录提交数据(包括完整汇总统计数据)的作者
* 目录收录的原始GWAS文献的作者
* 历任与现任GWAS目录团队成员
* 为基础遗传研究贡献数据的研究参与者
---
# 引用说明
若您在研究中使用本数据集,请引用GWAS目录的相关出版物:
Maria Cerezo, Elliot Sollis, Yue Ji, Elizabeth Lewis, Ala Abid, Karatuğ Ozan Bircan, Peggy Hall, James Hayhurst, Sajo John, Abayomi Mosaku, Santhi Ramachandran, Amy Foreman, Arwa Ibrahim, James McLaughlin, Zoë Pendlington, Ray Stefancsik, Samuel A Lambert, Aoife McMahon, Joannella Morales, Thomas Keane, Michael Inouye, Helen Parkinson, Laura W Harris, The NHGRI-EBI GWAS Catalog: standards for reusability, sustainability and diversity, Nucleic Acids Research, Volume 53, Issue D1, 6 January 2025, Pages D998–D1005, https://doi.org/10.1093/nar/gkae1070
bibtex
@article{cerezo2025nhgri,
title={The NHGRI-EBI GWAS Catalog: standards for reusability, sustainability and diversity},
author={Cerezo, Maria and Sollis, Elliot and Ji, Yue and Lewis, Elizabeth and Abid, Ala and Bircan, Karatuğ Ozan and Hall, Peggy and Hayhurst, James and John, Sajo and Mosaku, Abayomi and others},
journal={Nucleic acids research},
volume={53},
number={D1},
pages={D998--D1005},
year={2025},
publisher={Oxford University Press}
}
---
# 许可证
NHGRI-EBI GWAS目录及其所有内容均遵循[EMBL-EBI服务通用使用条款](https://ebi.ac.uk/about/terms-of-use)发布。除非另有说明,汇总统计数据采用CC0许可证发布。我们建议使用GWAS目录托管数据的使用者,根据自身具体使用场景留意各数据集的许可证条款。
提供机构:
gwascatalog



