BEND
收藏BEND 数据集概述
数据集描述
BEND(Benchmarking DNA Language Models on Biologically Meaningful Tasks)数据集用于评估DNA语言模型在生物学上有意义的任务上的性能。
数据格式
数据集中的每个任务数据以bed文件格式存储,包含每个样本的基因组坐标、分割成员信息和标签。标签如果过于复杂,会存储在hdf5文件中,两者共享索引。
bed文件示例
chromosome start end split label chr1 1055037 1055849 train 1 chr3 1070026 1070436 valid 0
数据下载
所有数据可通过脚本下载,具体方法见数据下载部分。
数据预处理
计算嵌入向量
为了训练下游模型,建议预先计算并保存嵌入向量。使用Webdataset的tar.gz文件格式存储。
嵌入向量计算脚本
python scripts/precompute_embeddings.py
嵌入向量概览
嵌入向量计算工具位于bend.embedders中,每个嵌入向量工具接受一个检查点路径,并提供一个embed()方法,该方法接受一系列序列并返回一系列嵌入向量。
嵌入向量工具示例
python from bend.embedders import NucleotideTransformerEmbedder
embedder = NucleotideTransformerEmbedder(InstaDeepAI/nucleotide-transformer-2.5b-multi-species) embeddings = embedder.embed([AGGATGCCGAGAGTATATGGGA, CCCAACCGAGAGTATATGTTAT])
模型评估
监督学习任务
完成嵌入向量计算后,可以使用以下脚本进行下游任务训练和评估:
python scripts/train_on_task.py --config-name {task}
无监督学习任务
对于无监督的变体效应预测,嵌入向量不需要预先计算和存储,直接生成并评估: bash python3 scripts/predict_variant_effects.py {variant_file_name}.bed {output_file_name}.csv {model_type} {path_to_checkpoint} {path_to_reference_genome_fasta} --embedding_idx {position_of_embedding}
数据集扩展
添加新的嵌入向量工具
所有嵌入向量工具定义在bend/utils/embedders.py中,继承自BaseEmbedder。需要实现load_model和embed方法。
添加新的任务
新任务的数据需要以bed格式存储,并在../conf/supervised_tasks中添加新的配置文件。
引用指南
使用数据集时,请确保正确引用原始数据来源。
引用示例
-
基因发现:GENCODE
@article{frankish_gencode_2021, title = {{GENCODE} 2021}, volume = {49}, issn = {0305-1048}, url = {https://doi.org/10.1093/nar/gkaa1087}, doi = {10.1093/nar/gkaa1087}, number = {D1}, urldate = {2022-09-26}, journal = {Nucleic Acids Research}, author = {Frankish, Adam and Diekhans, Mark and Jungreis, Irwin and Lagarde, Julien and Loveland, Jane E and Mudge, Jonathan M and Sisu, Cristina and Wright, James C and Armstrong, Joel and Barnes, If and Berry, Andrew and Bignell, Alexandra and Boix, Carles and Carbonell Sala, Silvia and Cunningham, Fiona and Di Domenico, Tomás and Donaldson, Sarah and Fiddes, Ian T and García Girón, Carlos and Gonzalez, Jose Manuel and Grego, Tiago and Hardy, Matthew and Hourlier, Thibaut and Howe, Kevin L and Hunt, Toby and Izuogu, Osagie G and Johnson, Rory and Martin, Fergal J and Martínez, Laura and Mohanan, Shamika and Muir, Paul and Navarro, Fabio C P and Parker, Anne and Pei, Baikang and Pozo, Fernando and Riera, Ferriol Calvet and Ruffier, Magali and Schmitt, Bianca M and Stapleton, Eloise and Suner, Marie-Marthe and Sycheva, Irina and Uszczynska-Ratajczak, Barbara and Wolf, Maxim Y and Xu, Jinuri and Yang, Yucheng T and Yates, Andrew and Zerbino, Daniel and Zhang, Yan and Choudhary, Jyoti S and Gerstein, Mark and Guigó, Roderic and Hubbard, Tim J P and Kellis, Manolis and Paten, Benedict and Tress, Michael L and Flicek, Paul}, month = jan, year = {2021}, pages = {D916--D923}, }
-
染色质可及性、组蛋白修饰、CpG甲基化:ENCODE
@article{noauthor_integrated_2012, title = {An {Integrated} {Encyclopedia} of {DNA} {Elements} in the {Human} {Genome}}, volume = {489}, issn = {0028-0836}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3439153/}, doi = {10.1038/nature11247}, number = {7414}, urldate = {2023-05-23}, journal = {Nature}, month = sep, year = {2012}, pmid = {22955616}, pmcid = {PMC3439153}, pages = {57--74}, }
-
增强子注释:Fulco et al., Gasperini et al., Avsec et al.
@article{fulco_activity-by-contact_2019, title = {Activity-by-contact model of enhancer–promoter regulation from thousands of {CRISPR} perturbations}, volume = {51}, copyright = {2019 The Author(s), under exclusive licence to Springer Nature America, Inc.}, issn = {1546-1718}, url = {https://www.nature.com/articles/s41588-019-0538-0}, doi = {10.1038/s41588-019-0538-0}, language = {en}, number = {12}, urldate = {2023-05-23}, journal = {Nature Genetics}, author = {Fulco, Charles P. and Nasser, Joseph and Jones, Thouis R. and Munson, Glen and Bergman, Drew T. and Subramanian, Vidya and Grossman, Sharon R. and Anyoha, Rockwell and Doughty, Benjamin R. and Patwardhan, Tejal A. and Nguyen, Tung H. and Kane, Michael and Perez, Elizabeth M. and Durand, Neva C. and Lareau, Caleb A. and Stamenova, Elena K. and Aiden, Erez Lieberman and Lander, Eric S. and Engreitz, Jesse M.}, month = dec, year = {2019}, note = {Number: 12 Publisher: Nature Publishing Group}, keywords = {Epigenetics, Epigenomics, Functional genomics, Gene expression, Gene regulation}, pages = {1664--1669}, }
-
非编码变体效应(表达):DeepSEA
@article{zhou_predicting_2015, title = {Predicting effects of noncoding variants with deep learning–based sequence model}, url = {https://www.nature.com/articles/nmeth.3547}, doi = {10.1038/nmeth.3547}, language = {en}, number = {10}, urldate = {2023-06-07}, journal = {Nature Methods}, author = {Zhou, Jian and Troyanskaya, Olga G}, year = {2015}, }
-
非编码变体效应(疾病):ClinVar
@article{10.1093/nar/gkz972, author = {Landrum, Melissa J and Chitipiralla, Shanmuga and Brown, Garth R and Chen, Chao and Gu, Baoshan and Hart, Jennifer and Hoffman, Douglas and Jang, Wonhee and Kaur, Kuljeet and Liu, Chunlei and Lyoshin, Vitaly and Maddipatla, Zenith and Maiti, Rama and Mitchell, Joseph and O’Leary, Nuala and Riley, George R and Shi, Wenyao and Zhou, George and Schneider, Valerie and Maglott, Donna and Holmes, J Bradley and Kattman, Brandi L}, title = "{ClinVar: improvements to accessing data}", journal = {Nucleic Acids Research}, volume = {48}, number = {D1}, pages = {D835-D844}, year = {2019}, month = {11}, issn = {0305-1048}, doi = {10.1093/nar/gkz972}, url = {https://doi.org/10.1093/nar/gkz972}, eprint = {https://academic.oup.com/nar/article-pdf/48/D1/D835/31698033/gkz972.pdf}, }

- 1BEND: Benchmarking DNA Language Models on biologically meaningful tasks · 2024年



