five

Sequence and Fitness Datasets for Variant Fitness Prediction using Protein Language Models

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/6784458
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset bundle contains three sets: 1) domain sequences for pretraining, 2) domain sequences for finetuning and 3) variant fitness scores. Files are in lmdb format. 1. Domain sequences for pretraining Two bz2 compressed files are provided: rp15_seq_lmdb.tar.bz2: representative proteome sequences at 15% level from Pfam-V32 database. Whole dataset is randomly split into train and validation sets: number of sequences in training set: 12,681,738; number of sequences in validation set: 1,042,103. Sequence length range from 18 to 500 (inclusive) and this length filtered set covers more than 95% sequences of the whole set.   rp75_seq_lmdb.tar.bz2: representative proteome sequences at 75% level from Pfam-V32 database. Whole dataset is randomly split into train and validation sets: number of sequences in training set: 68,810,960; number of sequences in validation set: 5,687,282. Sequence length range from 18 to 500 (inclusive) and this length filtered set covers more than 95% sequences of the whole set.   Information of each sequence is stored as key-value pairs: { 'primary': protein amino acid sequence, 'protein_length': length of the sequence, 'family': sequence Pfam family id (without 'PF'), 'clan': sequence Pfam clan id (without 'CL', -1 if not exists), 'unpIden': sequence Uniprot_id.version_number, 'range': domain residue start-end indices (follow indices of Uniprot seq), 'id': a index number for each sequence from 0 to N } One example: {'primary': 'ALQTTDKHHVATPANWRPGDDVIVPPPATQEAAEERLREG', 'protein_length': 40, 'family': 10417, 'clan': -1, 'unpIden': 'A0A147JSN0.1', 'range': '162-201', 'id': '0'}   2. Domain sequences for finetuning We collected homologous sequences of 33 proteins from [Shin2021]. The sequences are domain sequences queried over UniRef100 database. Each family is split into train and validation sets with ratio 9:1 Information of each sequence is stored as key-value pairs: { 'unp_range': Uniprot record name/start index - end index (indices follow Uniprot seq), 'primary': protein amino acid sequence, 'seq_reweight': sequence weighting score from Shin2021, 'family_reweight': family weighting score from Shin2021 (sum of seq_reweight score for all family sequences), 'seq_reweight_mmseqs2': sequence weighting score calculated by us using mmseqs2, 'family_reweight_mmseqs2': family weighting score based on seq_reweight_mmseqs2 (sum of seq_reweight_mmseqs2 score for all family sequences) } One example: { 'unp_range': 'AMIE_PSEAE/1-346', 'primary': 'MRHGDISSSNDTVGVAVVNYKMPRLHTAAEVLDNARKIAEMIVGMKQGLPGMDLVVFPEYSLQGIMYDPAEMMETAVAIPGEETEIFSRACRKANVWGVFSLTGERHEEHPRKAPYNTLVLIDNNGEIVQKYRKIIPWCPIEGWYPGGQTYVSEGPKGMKISLIICDDGNYPEIWRDCAMKGAELIVRCQGYMYPAKDQQVMMAKAMAWANNCYVAVANAAGFDGVYSYFGHSAIIGFDGRTLGECGEEEMGIQYAQLSLSQIRDARANDQSQNHLFKILHRGYSGLQASGDGDRGLAECPFEFYRTWVTDAEKARENVERLTRSTTGVAQCPVGRLPYEGLEKEA', 'seq_reweight': 0.0714285714286, 'family_reweight': 19553.99941694187, 'seq_reweight_mmseqs2': 0.0021413276231263384, 'family_reweight_mmseqs2': 25236.560885598774 } 3. Variant fitness scores This fitness benchmark set contains 42 mutagenesis sets, which were from originally curated by [DeepSequence] and later [Shin2021] used a subset of it.  Information of each variant is stored as key-value pairs: { 'set_nm': set name, 'wt_seq': WT sequence, 'seq_len': sequence length, 'mutants': amino acid variants list (could have multi-site mutations), 'mut_relative_idxs': list of relative amino acid indices for variants, 'mut_seq': mutant sequence, 'fitness': fitness score } One example: { 'set_nm': 'AMIE_PSEAE_Whitehead', 'wt_seq': 'MRHGDISSSNDTVGVAVVNYKMPRLHTAAEVLDNARKIAEMIVGMKQGLPGMDLVVFPEYSLQGIMYDPAEMMETAVAIPGEETEIFSRACRKANVWGVFSLTGERHEEHPRKAPYNTLVLIDNNGEIVQKYRKIIPWCPIEGWYPGGQTYVSEGPKGMKISLIICDDGNYPEIWRDCAMKGAELIVRCQGYMYPAKDQQVMMAKAMAWANNCYVAVANAAGFDGVYSYFGHSAIIGFDGRTLGECGEEEMGIQYAQLSLSQIRDARANDQSQNHLFKILHRGYSGLQASGDGDRGLAECPFEFYRTWVTDAEKARENVERLTRSTTGVAQCPVGRLPYEG', 'seq_len': 341, 'mutants': ['M1W'], 'mut_relative_idxs': [0], 'mut_seq': 'WRHGDISSSNDTVGVAVVNYKMPRLHTAAEVLDNARKIAEMIVGMKQGLPGMDLVVFPEYSLQGIMYDPAEMMETAVAIPGEETEIFSRACRKANVWGVFSLTGERHEEHPRKAPYNTLVLIDNNGEIVQKYRKIIPWCPIEGWYPGGQTYVSEGPKGMKISLIICDDGNYPEIWRDCAMKGAELIVRCQGYMYPAKDQQVMMAKAMAWANNCYVAVANAAGFDGVYSYFGHSAIIGFDGRTLGECGEEEMGIQYAQLSLSQIRDARANDQSQNHLFKILHRGYSGLQASGDGDRGLAECPFEFYRTWVTDAEKARENVERLTRSTTGVAQCPVGRLPYEG', 'fitness': -0.5174 } Reference DeepSequence: Riesselman, Adam J., John B. Ingraham, and Debora S. Marks. "Deep generative models of genetic variation capture the effects of mutations." Nature methods 15.10 (2018): 816-822. Shin2021:Shin, Jung-Eun, et al. "Protein design and variant prediction using autoregressive generative models." Nature communications 12.1 (2021): 1-11.

本数据集套件包含三类数据:1)预训练用域序列,2)微调用域序列,3)突变体适配度评分。所有文件均采用lmdb(lmdb)格式存储。 1. 预训练用域序列 本部分提供两个经bz2压缩的文件: rp15_seq_lmdb.tar.bz2:源自Pfam-V32数据库(Pfam)、聚类相似度为15%的代表性蛋白质组序列。完整数据集已随机划分为训练集与验证集:训练集包含12,681,738条序列,验证集包含1,042,103条序列。序列长度范围为18至500(含两端),该长度过滤后的数据集覆盖了原数据集95%以上的序列。 rp75_seq_lmdb.tar.bz2:源自Pfam-V32数据库(Pfam)、聚类相似度为75%的代表性蛋白质组序列。完整数据集已随机划分为训练集与验证集:训练集包含68,810,960条序列,验证集包含5,687,282条序列。序列长度范围为18至500(含两端),该长度过滤后的数据集覆盖了原数据集95%以上的序列。 每条序列的信息以键值对形式存储: { 'primary': 蛋白质氨基酸序列, 'protein_length': 序列长度, 'family': 序列所属Pfam家族ID(不含前缀'PF'), 'clan': 序列所属Pfam家族簇ID(不含前缀'CL',无簇则为-1), 'unpIden': 序列的Uniprot(Uniprot)ID.版本号, 'range': 域残基起始-终止索引(遵循Uniprot序列的索引规则), 'id': 序列的索引编号,范围为0至N } 示例如下: { 'primary': 'ALQTTDKHHVATPANWRPGDDVIVPPPATQEAAEERLREG', 'protein_length': 40, 'family': 10417, 'clan': -1, 'unpIden': 'A0A147JSN0.1', 'range': '162-201', 'id': '0' } 2. 微调用域序列 本部分采集了源自文献[Shin2021]的33种蛋白质的同源序列,这些序列为通过UniRef100(UniRef100)数据库检索得到的域序列。每个家族按9:1的比例随机划分为训练集与验证集。 每条序列的信息以键值对形式存储: { 'unp_range': Uniprot记录名称/起始索引-终止索引(索引遵循Uniprot序列的索引规则), 'primary': 蛋白质氨基酸序列, 'seq_reweight': 源自Shin2021的序列加权评分, 'family_reweight': 源自Shin2021的家族加权评分(为该家族所有序列的seq_reweight评分之和), 'seq_reweight_mmseqs2': 本团队使用mmseqs2(mmseqs2)计算得到的序列加权评分, 'family_reweight_mmseqs2': 基于seq_reweight_mmseqs2计算得到的家族加权评分(为该家族所有序列的seq_reweight_mmseqs2评分之和) } 示例如下: { 'unp_range': 'AMIE_PSEAE/1-346', 'primary': 'MRHGDISSSNDTVGVAVVNYKMPRLHTAAEVLDNARKIAEMIVGMKQGLPGMDLVVFPEYSLQGIMYDPAEMMETAVAIPGEETEIFSRACRKANVWGVFSLTGERHEEHPRKAPYNTLVLIDNNGEIVQKYRKIIPWCPIEGWYPGGQTYVSEGPKGMKISLIICDDGNYPEIWRDCAMKGAELIVRCQGYMYPAKDQQVMMAKAMAWANNCYVAVANAAGFDGVYSYFGHSAIIGFDGRTLGECGEEEMGIQYAQLSLSQIRDARANDQSQNHLFKILHRGYSGLQASGDGDRGLAECPFEFYRTWVTDAEKARENVERLTRSTTGVAQCPVGRLPYEGLEKEA', 'seq_reweight': 0.0714285714286, 'family_reweight': 19553.99941694187, 'seq_reweight_mmseqs2': 0.0021413276231263384, 'family_reweight_mmseqs2': 25236.560885598774 } 3. 突变体适配度评分 本适配度基准数据集包含42套诱变数据集,最初由[DeepSequence]整理,后续[Shin2021]使用了其中的子集。 每个突变体的信息以键值对形式存储: { 'set_nm': 数据集名称, 'wt_seq': 野生型序列, 'seq_len': 序列长度, 'mutants': 氨基酸突变体列表(可包含多位点突变), 'mut_relative_idxs': 突变体对应的相对氨基酸索引列表, 'mut_seq': 突变体序列, 'fitness': 适配度评分 } 示例如下: { 'set_nm': 'AMIE_PSEAE_Whitehead', 'wt_seq': 'MRHGDISSSNDTVGVAVVNYKMPRLHTAAEVLDNARKIAEMIVGMKQGLPGMDLVVFPEYSLQGIMYDPAEMMETAVAIPGEETEIFSRACRKANVWGVFSLTGERHEEHPRKAPYNTLVLIDNNGEIVQKYRKIIPWCPIEGWYPGGQTYVSEGPKGMKISLIICDDGNYPEIWRDCAMKGAELIVRCQGYMYPAKDQQVMMAKAMAWANNCYVAVANAAGFDGVYSYFGHSAIIGFDGRTLGECGEEEMGIQYAQLSLSQIRDARANDQSQNHLFKILHRGYSGLQASGDGDRGLAECPFEFYRTWVTDAEKARENVERLTRSTTGVAQCPVGRLPYEG', 'seq_len': 341, 'mutants': ['M1W'], 'mut_relative_idxs': [0], 'mut_seq': 'WRHGDISSSNDTVGVAVVNYKMPRLHTAAEVLDNARKIAEMIVGMKQGLPGMDLVVFPEYSLQGIMYDPAEMMETAVAIPGEETEIFSRACRKANVWGVFSLTGERHEEHPRKAPYNTLVLIDNNGEIVQKYRKIIPWCPIEGWYPGGQTYVSEGPKGMKISLIICDDGNYPEIWRDCAMKGAELIVRCQGYMYPAKDQQVMMAKAMAWANNCYVAVANAAGFDGVYSYFGHSAIIGFDGRTLGECGEEEMGIQYAQLSLSQIRDARANDQSQNHLFKILHRGYSGLQASGDGDRGLAECPFEFYRTWVTDAEKARENVERLTRSTTGVAQCPVGRLPYEG', 'fitness': -0.5174 } 参考文献 DeepSequence:Riesselman, Adam J.、John B. Ingraham与Debora S. Marks。《Deep generative models of genetic variation capture the effects of mutations》,Nature methods 15.10 (2018): 816-822。 Shin2021:Shin, Jung-Eun等。《Protein design and variant prediction using autoregressive generative models》,Nature communications 12.1 (2021): 1-11。
创建时间:
2024-01-11
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作