five

TraitGym

收藏
魔搭社区2025-06-30 更新2025-07-05 收录
下载链接:
https://modelscope.cn/datasets/lgq12697/TraitGym
下载链接
链接失效反馈
官方服务:
资源简介:
# 🧬 TraitGym [Benchmarking DNA Sequence Models for Causal Regulatory Variant Prediction in Human Genetics](https://www.biorxiv.org/content/10.1101/2025.02.11.637758v1) 🏆 Leaderboard: https://huggingface.co/spaces/songlab/TraitGym-leaderboard ## ⚡️ Quick start - Load a dataset ```python from datasets import load_dataset dataset = load_dataset("songlab/TraitGym", "mendelian_traits", split="test") ``` - Example notebook to run variant effect prediction with a gLM, runs in 5 min on Google Colab: `TraitGym.ipynb` [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/songlab-cal/TraitGym/blob/main/TraitGym.ipynb) ## 🤗 Resources (https://huggingface.co/datasets/songlab/TraitGym) - Datasets: `{dataset}/test.parquet` - Subsets: `{dataset}/subset/{subset}.parquet` - Features: `{dataset}/features/{features}.parquet` - Predictions: `{dataset}/preds/{subset}/{model}.parquet` - Metrics: `{dataset}/{metric}/{subset}/{model}.csv` `dataset` examples (`load_dataset` config name): - `mendelian_traits_matched_9` (`mendelian_traits`) - `complex_traits_matched_9` (`complex_traits`) - `mendelian_traits_all` (`mendelian_traits_full`) - `complex_traits_all` (`complex_traits_full`) `subset` examples: - `all` (default) - `3_prime_UTR_variant` - `disease` - `BMI` `features` examples: - `GPN-MSA_LLR` - `GPN-MSA_InnerProducts` - `Borzoi_L2` `model` examples: - `GPN-MSA_LLR.minus.score` - `GPN-MSA.LogisticRegression.chrom` - `CADD+GPN-MSA+Borzoi.LogisticRegression.chrom` `metric` examples: - `AUPRC_by_chrom_weighted_average` (main metric) - `AUPRC` ## 💻 Code (https://github.com/songlab-cal/TraitGym) - Tries to follow [recommended Snakemake structure](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html) - GPN-Promoter code is in [the main GPN repo](https://github.com/songlab-cal/gpn) ### Installation First, clone the repo and `cd` into it. Second, install the dependencies: ```bash conda env create -f workflow/envs/general.yaml conda activate TraitGym ``` Optionally, download precomputed datasets and predictions (6.7G): ```bash mkdir -p results/dataset huggingface-cli download songlab/TraitGym --repo-type dataset --local-dir results/dataset/ ``` ### Running To compute a specific result, specify its path: ```bash snakemake --cores all ``` Example paths (these are already computed): ```bash # zero-shot LLR results/dataset/complex_traits_matched_9/AUPRC_by_chrom_weighted_average/all/GPN-MSA_absLLR.plus.score.csv # logistic regression/linear probing results/dataset/complex_traits_matched_9/AUPRC_by_chrom_weighted_average/all/GPN-MSA.LogisticRegression.chrom.csv ``` We recommend the following: ```bash # Snakemake sometimes gets confused about which files it needs to rerun and this forces # not to rerun any existing file snakemake --cores all --touch # to output an execution plan snakemake --cores all --dry-run ``` To evaluate your own set of model features, place a dataframe of shape `n_variants,n_features` in `results/dataset/{dataset}/features/{features}.parquet`. For zero-shot evaluation of column `{feature}` and sign `{sign}` (`plus` or `minus`), you would invoke: ```bash snakemake --cores all results/dataset/{dataset}/{metric}/all/{features}.{sign}.{feature}.csv ``` To train and evaluate a logistic regression model, you would invoke: ```bash snakemake --cores all results/dataset/{dataset}/{metric}/all/{feature_set}.LogisticRegression.chrom.csv ``` where `{feature_set}` should first be defined in `feature_sets` in `config/config.yaml` (this allows combining features defined in different files). ## Citation [Link to paper](https://www.biorxiv.org/content/10.1101/2025.02.11.637758v2) ```bibtex @article{traitgym, title={Benchmarking DNA Sequence Models for Causal Regulatory Variant Prediction in Human Genetics}, author={Benegas, Gonzalo and Eraslan, G{\"o}kcen and Song, Yun S}, journal={bioRxiv}, pages={2025--02}, year={2025}, publisher={Cold Spring Harbor Laboratory} } ```

# 🧬 TraitGym [Benchmarking DNA Sequence Models for Causal Regulatory Variant Prediction in Human Genetics](https://www.biorxiv.org/content/10.1101/2025.02.11.637758v1) 🏆 排行榜:https://huggingface.co/spaces/songlab/TraitGym-leaderboard ## ⚡️ 快速上手 - 加载数据集 python from datasets import load_dataset dataset = load_dataset("songlab/TraitGym", "mendelian_traits", split="test") - 可在Google Colab中5分钟内完成基因组语言模型(genomic language model,gLM)变异效应预测的示例笔记本:`TraitGym.ipynb` [![在Colab中打开](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/songlab-cal/TraitGym/blob/main/TraitGym.ipynb) ## 🤗 资源库(https://huggingface.co/datasets/songlab/TraitGym) - 数据集:`{dataset}/test.parquet` - 子集:`{dataset}/subset/{subset}.parquet` - 特征:`{dataset}/features/{features}.parquet` - 预测结果:`{dataset}/preds/{subset}/{model}.parquet` - 评估指标:`{dataset}/{metric}/{subset}/{model}.csv` `dataset` 配置示例(`load_dataset` 配置名称): - `mendelian_traits_matched_9`(对应配置名`mendelian_traits`) - `complex_traits_matched_9`(对应配置名`complex_traits`) - `mendelian_traits_all`(对应配置名`mendelian_traits_full`) - `complex_traits_all`(对应配置名`complex_traits_full`) `subset` 子集示例: - `all`(默认子集) - `3_prime_UTR_variant`(3'非翻译区(3' untranslated region,3'UTR)变异) - `disease`(疾病相关子集) - `BMI`(体质量指数相关子集) `features` 特征示例: - `GPN-MSA_LLR` - `GPN-MSA_InnerProducts` - `Borzoi_L2` `model` 模型示例: - `GPN-MSA_LLR.minus.score` - `GPN-MSA.LogisticRegression.chrom` - `CADD+GPN-MSA+Borzoi.LogisticRegression.chrom`(联合注释依赖损耗(Combined Annotation Dependent Depletion,CADD)模型组合) `metric` 指标示例: - `AUPRC_by_chrom_weighted_average`(主评估指标:精确率-召回率曲线下面积(Area Under the Precision-Recall Curve,AUPRC)按染色体加权平均) - `AUPRC`(精确率-召回率曲线下面积) ## 💻 代码库(https://github.com/songlab-cal/TraitGym) - 遵循[Snakemake推荐结构](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html) - GPN-Promoter代码位于[主GPN仓库](https://github.com/songlab-cal/gpn)中 ### 安装 首先,克隆该仓库并进入目录。其次,安装依赖环境: bash conda env create -f workflow/envs/general.yaml conda activate TraitGym 可选:下载预计算的数据集与预测结果(总大小6.7G): bash mkdir -p results/dataset huggingface-cli download songlab/TraitGym --repo-type dataset --local-dir results/dataset/ ### 运行 若需计算特定结果,可指定其路径: bash snakemake --cores all 示例路径(已预先计算完成): bash # 零样本对数似然比(Log Likelihood Ratio,LLR) results/dataset/complex_traits_matched_9/AUPRC_by_chrom_weighted_average/all/GPN-MSA_absLLR.plus.score.csv # 逻辑回归/线性探测 results/dataset/complex_traits_matched_9/AUPRC_by_chrom_weighted_average/all/GPN-MSA.LogisticRegression.chrom.csv 我们推荐以下操作: bash # Snakemake有时会误判需要重新运行的文件,该命令可强制不重新生成已存在的文件 snakemake --cores all --touch # 输出执行计划 snakemake --cores all --dry-run 若需评估自定义的模型特征集,请将形状为`n_variants,n_features`的DataFrame放置于`results/dataset/{dataset}/features/{features}.parquet`路径下。 若需对特征列`{feature}`与符号`{sign}`(`plus`或`minus`)进行零样本评估,可执行以下命令: bash snakemake --cores all results/dataset/{dataset}/{metric}/all/{features}.{sign}.{feature}.csv 若需训练并评估逻辑回归模型,可执行以下命令: bash snakemake --cores all results/dataset/{dataset}/{metric}/all/{feature_set}.LogisticRegression.chrom.csv 其中`{feature_set}`需首先在`config/config.yaml`的`feature_sets`中定义(该配置允许合并不同文件中的特征)。 ## 引用 [论文链接](https://www.biorxiv.org/content/10.1101/2025.02.11.637758v2) bibtex @article{traitgym, title={Benchmarking DNA Sequence Models for Causal Regulatory Variant Prediction in Human Genetics}, author={Benegas, Gonzalo and Eraslan, G{"o}kcen and Song, Yun S}, journal={bioRxiv}, pages={2025--02}, year={2025}, publisher={Cold Spring Harbor Laboratory} }
提供机构:
maas
创建时间:
2025-06-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作