TraitGym
收藏魔搭社区2025-06-30 更新2025-07-05 收录
下载链接:
https://modelscope.cn/datasets/lgq12697/TraitGym
下载链接
链接失效反馈官方服务:
资源简介:
# 🧬 TraitGym
[Benchmarking DNA Sequence Models for Causal Regulatory Variant Prediction in Human Genetics](https://www.biorxiv.org/content/10.1101/2025.02.11.637758v1)
🏆 Leaderboard: https://huggingface.co/spaces/songlab/TraitGym-leaderboard
## ⚡️ Quick start
- Load a dataset
```python
from datasets import load_dataset
dataset = load_dataset("songlab/TraitGym", "mendelian_traits", split="test")
```
- Example notebook to run variant effect prediction with a gLM, runs in 5 min on Google Colab: `TraitGym.ipynb` [](https://colab.research.google.com/github/songlab-cal/TraitGym/blob/main/TraitGym.ipynb)
## 🤗 Resources (https://huggingface.co/datasets/songlab/TraitGym)
- Datasets: `{dataset}/test.parquet`
- Subsets: `{dataset}/subset/{subset}.parquet`
- Features: `{dataset}/features/{features}.parquet`
- Predictions: `{dataset}/preds/{subset}/{model}.parquet`
- Metrics: `{dataset}/{metric}/{subset}/{model}.csv`
`dataset` examples (`load_dataset` config name):
- `mendelian_traits_matched_9` (`mendelian_traits`)
- `complex_traits_matched_9` (`complex_traits`)
- `mendelian_traits_all` (`mendelian_traits_full`)
- `complex_traits_all` (`complex_traits_full`)
`subset` examples:
- `all` (default)
- `3_prime_UTR_variant`
- `disease`
- `BMI`
`features` examples:
- `GPN-MSA_LLR`
- `GPN-MSA_InnerProducts`
- `Borzoi_L2`
`model` examples:
- `GPN-MSA_LLR.minus.score`
- `GPN-MSA.LogisticRegression.chrom`
- `CADD+GPN-MSA+Borzoi.LogisticRegression.chrom`
`metric` examples:
- `AUPRC_by_chrom_weighted_average` (main metric)
- `AUPRC`
## 💻 Code (https://github.com/songlab-cal/TraitGym)
- Tries to follow [recommended Snakemake structure](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html)
- GPN-Promoter code is in [the main GPN repo](https://github.com/songlab-cal/gpn)
### Installation
First, clone the repo and `cd` into it.
Second, install the dependencies:
```bash
conda env create -f workflow/envs/general.yaml
conda activate TraitGym
```
Optionally, download precomputed datasets and predictions (6.7G):
```bash
mkdir -p results/dataset
huggingface-cli download songlab/TraitGym --repo-type dataset --local-dir results/dataset/
```
### Running
To compute a specific result, specify its path:
```bash
snakemake --cores all
```
Example paths (these are already computed):
```bash
# zero-shot LLR
results/dataset/complex_traits_matched_9/AUPRC_by_chrom_weighted_average/all/GPN-MSA_absLLR.plus.score.csv
# logistic regression/linear probing
results/dataset/complex_traits_matched_9/AUPRC_by_chrom_weighted_average/all/GPN-MSA.LogisticRegression.chrom.csv
```
We recommend the following:
```bash
# Snakemake sometimes gets confused about which files it needs to rerun and this forces
# not to rerun any existing file
snakemake --cores all --touch
# to output an execution plan
snakemake --cores all --dry-run
```
To evaluate your own set of model features, place a dataframe of shape `n_variants,n_features` in `results/dataset/{dataset}/features/{features}.parquet`.
For zero-shot evaluation of column `{feature}` and sign `{sign}` (`plus` or `minus`), you would invoke:
```bash
snakemake --cores all results/dataset/{dataset}/{metric}/all/{features}.{sign}.{feature}.csv
```
To train and evaluate a logistic regression model, you would invoke:
```bash
snakemake --cores all results/dataset/{dataset}/{metric}/all/{feature_set}.LogisticRegression.chrom.csv
```
where `{feature_set}` should first be defined in `feature_sets` in `config/config.yaml` (this allows combining features defined in different files).
## Citation
[Link to paper](https://www.biorxiv.org/content/10.1101/2025.02.11.637758v2)
```bibtex
@article{traitgym,
title={Benchmarking DNA Sequence Models for Causal Regulatory Variant Prediction in Human Genetics},
author={Benegas, Gonzalo and Eraslan, G{\"o}kcen and Song, Yun S},
journal={bioRxiv},
pages={2025--02},
year={2025},
publisher={Cold Spring Harbor Laboratory}
}
```
# 🧬 TraitGym
[Benchmarking DNA Sequence Models for Causal Regulatory Variant Prediction in Human Genetics](https://www.biorxiv.org/content/10.1101/2025.02.11.637758v1)
🏆 排行榜:https://huggingface.co/spaces/songlab/TraitGym-leaderboard
## ⚡️ 快速上手
- 加载数据集
python
from datasets import load_dataset
dataset = load_dataset("songlab/TraitGym", "mendelian_traits", split="test")
- 可在Google Colab中5分钟内完成基因组语言模型(genomic language model,gLM)变异效应预测的示例笔记本:`TraitGym.ipynb` [](https://colab.research.google.com/github/songlab-cal/TraitGym/blob/main/TraitGym.ipynb)
## 🤗 资源库(https://huggingface.co/datasets/songlab/TraitGym)
- 数据集:`{dataset}/test.parquet`
- 子集:`{dataset}/subset/{subset}.parquet`
- 特征:`{dataset}/features/{features}.parquet`
- 预测结果:`{dataset}/preds/{subset}/{model}.parquet`
- 评估指标:`{dataset}/{metric}/{subset}/{model}.csv`
`dataset` 配置示例(`load_dataset` 配置名称):
- `mendelian_traits_matched_9`(对应配置名`mendelian_traits`)
- `complex_traits_matched_9`(对应配置名`complex_traits`)
- `mendelian_traits_all`(对应配置名`mendelian_traits_full`)
- `complex_traits_all`(对应配置名`complex_traits_full`)
`subset` 子集示例:
- `all`(默认子集)
- `3_prime_UTR_variant`(3'非翻译区(3' untranslated region,3'UTR)变异)
- `disease`(疾病相关子集)
- `BMI`(体质量指数相关子集)
`features` 特征示例:
- `GPN-MSA_LLR`
- `GPN-MSA_InnerProducts`
- `Borzoi_L2`
`model` 模型示例:
- `GPN-MSA_LLR.minus.score`
- `GPN-MSA.LogisticRegression.chrom`
- `CADD+GPN-MSA+Borzoi.LogisticRegression.chrom`(联合注释依赖损耗(Combined Annotation Dependent Depletion,CADD)模型组合)
`metric` 指标示例:
- `AUPRC_by_chrom_weighted_average`(主评估指标:精确率-召回率曲线下面积(Area Under the Precision-Recall Curve,AUPRC)按染色体加权平均)
- `AUPRC`(精确率-召回率曲线下面积)
## 💻 代码库(https://github.com/songlab-cal/TraitGym)
- 遵循[Snakemake推荐结构](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html)
- GPN-Promoter代码位于[主GPN仓库](https://github.com/songlab-cal/gpn)中
### 安装
首先,克隆该仓库并进入目录。其次,安装依赖环境:
bash
conda env create -f workflow/envs/general.yaml
conda activate TraitGym
可选:下载预计算的数据集与预测结果(总大小6.7G):
bash
mkdir -p results/dataset
huggingface-cli download songlab/TraitGym --repo-type dataset --local-dir results/dataset/
### 运行
若需计算特定结果,可指定其路径:
bash
snakemake --cores all
示例路径(已预先计算完成):
bash
# 零样本对数似然比(Log Likelihood Ratio,LLR)
results/dataset/complex_traits_matched_9/AUPRC_by_chrom_weighted_average/all/GPN-MSA_absLLR.plus.score.csv
# 逻辑回归/线性探测
results/dataset/complex_traits_matched_9/AUPRC_by_chrom_weighted_average/all/GPN-MSA.LogisticRegression.chrom.csv
我们推荐以下操作:
bash
# Snakemake有时会误判需要重新运行的文件,该命令可强制不重新生成已存在的文件
snakemake --cores all --touch
# 输出执行计划
snakemake --cores all --dry-run
若需评估自定义的模型特征集,请将形状为`n_variants,n_features`的DataFrame放置于`results/dataset/{dataset}/features/{features}.parquet`路径下。
若需对特征列`{feature}`与符号`{sign}`(`plus`或`minus`)进行零样本评估,可执行以下命令:
bash
snakemake --cores all results/dataset/{dataset}/{metric}/all/{features}.{sign}.{feature}.csv
若需训练并评估逻辑回归模型,可执行以下命令:
bash
snakemake --cores all results/dataset/{dataset}/{metric}/all/{feature_set}.LogisticRegression.chrom.csv
其中`{feature_set}`需首先在`config/config.yaml`的`feature_sets`中定义(该配置允许合并不同文件中的特征)。
## 引用
[论文链接](https://www.biorxiv.org/content/10.1101/2025.02.11.637758v2)
bibtex
@article{traitgym,
title={Benchmarking DNA Sequence Models for Causal Regulatory Variant Prediction in Human Genetics},
author={Benegas, Gonzalo and Eraslan, G{"o}kcen and Song, Yun S},
journal={bioRxiv},
pages={2025--02},
year={2025},
publisher={Cold Spring Harbor Laboratory}
}
提供机构:
maas
创建时间:
2025-06-30



