Sequence and Fitness Datasets for Variant Fitness Prediction using Protein Language Models
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/6784458
下载链接
链接失效反馈官方服务:
资源简介:
This dataset bundle contains three sets: 1) domain sequences for pretraining, 2) domain sequences for finetuning and 3) variant fitness scores. Files are in lmdb format.
1. Domain sequences for pretraining
Two bz2 compressed files are provided:
rp15_seq_lmdb.tar.bz2: representative proteome sequences at 15% level from Pfam-V32 database. Whole dataset is randomly split into train and validation sets: number of sequences in training set: 12,681,738; number of sequences in validation set: 1,042,103. Sequence length range from 18 to 500 (inclusive) and this length filtered set covers more than 95% sequences of the whole set.
rp75_seq_lmdb.tar.bz2: representative proteome sequences at 75% level from Pfam-V32 database. Whole dataset is randomly split into train and validation sets: number of sequences in training set: 68,810,960; number of sequences in validation set: 5,687,282. Sequence length range from 18 to 500 (inclusive) and this length filtered set covers more than 95% sequences of the whole set.
Information of each sequence is stored as key-value pairs:
{
'primary': protein amino acid sequence,
'protein_length': length of the sequence,
'family': sequence Pfam family id (without 'PF'),
'clan': sequence Pfam clan id (without 'CL', -1 if not exists),
'unpIden': sequence Uniprot_id.version_number,
'range': domain residue start-end indices (follow indices of Uniprot seq),
'id': a index number for each sequence from 0 to N
}
One example:
{'primary': 'ALQTTDKHHVATPANWRPGDDVIVPPPATQEAAEERLREG',
'protein_length': 40,
'family': 10417,
'clan': -1,
'unpIden': 'A0A147JSN0.1',
'range': '162-201',
'id': '0'}
2. Domain sequences for finetuning
We collected homologous sequences of 33 proteins from [Shin2021]. The sequences are domain sequences queried over UniRef100 database. Each family is split into train and validation sets with ratio 9:1
Information of each sequence is stored as key-value pairs:
{
'unp_range': Uniprot record name/start index - end index (indices follow Uniprot seq),
'primary': protein amino acid sequence,
'seq_reweight': sequence weighting score from Shin2021,
'family_reweight': family weighting score from Shin2021 (sum of seq_reweight score for all family sequences),
'seq_reweight_mmseqs2': sequence weighting score calculated by us using mmseqs2,
'family_reweight_mmseqs2': family weighting score based on seq_reweight_mmseqs2 (sum of seq_reweight_mmseqs2 score for all family sequences)
}
One example:
{
'unp_range': 'AMIE_PSEAE/1-346',
'primary': 'MRHGDISSSNDTVGVAVVNYKMPRLHTAAEVLDNARKIAEMIVGMKQGLPGMDLVVFPEYSLQGIMYDPAEMMETAVAIPGEETEIFSRACRKANVWGVFSLTGERHEEHPRKAPYNTLVLIDNNGEIVQKYRKIIPWCPIEGWYPGGQTYVSEGPKGMKISLIICDDGNYPEIWRDCAMKGAELIVRCQGYMYPAKDQQVMMAKAMAWANNCYVAVANAAGFDGVYSYFGHSAIIGFDGRTLGECGEEEMGIQYAQLSLSQIRDARANDQSQNHLFKILHRGYSGLQASGDGDRGLAECPFEFYRTWVTDAEKARENVERLTRSTTGVAQCPVGRLPYEGLEKEA',
'seq_reweight': 0.0714285714286,
'family_reweight': 19553.99941694187,
'seq_reweight_mmseqs2': 0.0021413276231263384,
'family_reweight_mmseqs2': 25236.560885598774
}
3. Variant fitness scores
This fitness benchmark set contains 42 mutagenesis sets, which were from originally curated by [DeepSequence] and later [Shin2021] used a subset of it.
Information of each variant is stored as key-value pairs:
{
'set_nm': set name,
'wt_seq': WT sequence,
'seq_len': sequence length,
'mutants': amino acid variants list (could have multi-site mutations),
'mut_relative_idxs': list of relative amino acid indices for variants,
'mut_seq': mutant sequence,
'fitness': fitness score
}
One example:
{
'set_nm': 'AMIE_PSEAE_Whitehead',
'wt_seq': 'MRHGDISSSNDTVGVAVVNYKMPRLHTAAEVLDNARKIAEMIVGMKQGLPGMDLVVFPEYSLQGIMYDPAEMMETAVAIPGEETEIFSRACRKANVWGVFSLTGERHEEHPRKAPYNTLVLIDNNGEIVQKYRKIIPWCPIEGWYPGGQTYVSEGPKGMKISLIICDDGNYPEIWRDCAMKGAELIVRCQGYMYPAKDQQVMMAKAMAWANNCYVAVANAAGFDGVYSYFGHSAIIGFDGRTLGECGEEEMGIQYAQLSLSQIRDARANDQSQNHLFKILHRGYSGLQASGDGDRGLAECPFEFYRTWVTDAEKARENVERLTRSTTGVAQCPVGRLPYEG',
'seq_len': 341,
'mutants': ['M1W'],
'mut_relative_idxs': [0],
'mut_seq': 'WRHGDISSSNDTVGVAVVNYKMPRLHTAAEVLDNARKIAEMIVGMKQGLPGMDLVVFPEYSLQGIMYDPAEMMETAVAIPGEETEIFSRACRKANVWGVFSLTGERHEEHPRKAPYNTLVLIDNNGEIVQKYRKIIPWCPIEGWYPGGQTYVSEGPKGMKISLIICDDGNYPEIWRDCAMKGAELIVRCQGYMYPAKDQQVMMAKAMAWANNCYVAVANAAGFDGVYSYFGHSAIIGFDGRTLGECGEEEMGIQYAQLSLSQIRDARANDQSQNHLFKILHRGYSGLQASGDGDRGLAECPFEFYRTWVTDAEKARENVERLTRSTTGVAQCPVGRLPYEG',
'fitness': -0.5174
}
Reference
DeepSequence: Riesselman, Adam J., John B. Ingraham, and Debora S. Marks. "Deep generative models of genetic variation capture the effects of mutations." Nature methods 15.10 (2018): 816-822.
Shin2021:Shin, Jung-Eun, et al. "Protein design and variant prediction using autoregressive generative models." Nature communications 12.1 (2021): 1-11.
本数据集套件包含三类数据:1)预训练用域序列,2)微调用域序列,3)突变体适配度评分。所有文件均采用lmdb(lmdb)格式存储。
1. 预训练用域序列
本部分提供两个经bz2压缩的文件:
rp15_seq_lmdb.tar.bz2:源自Pfam-V32数据库(Pfam)、聚类相似度为15%的代表性蛋白质组序列。完整数据集已随机划分为训练集与验证集:训练集包含12,681,738条序列,验证集包含1,042,103条序列。序列长度范围为18至500(含两端),该长度过滤后的数据集覆盖了原数据集95%以上的序列。
rp75_seq_lmdb.tar.bz2:源自Pfam-V32数据库(Pfam)、聚类相似度为75%的代表性蛋白质组序列。完整数据集已随机划分为训练集与验证集:训练集包含68,810,960条序列,验证集包含5,687,282条序列。序列长度范围为18至500(含两端),该长度过滤后的数据集覆盖了原数据集95%以上的序列。
每条序列的信息以键值对形式存储:
{
'primary': 蛋白质氨基酸序列,
'protein_length': 序列长度,
'family': 序列所属Pfam家族ID(不含前缀'PF'),
'clan': 序列所属Pfam家族簇ID(不含前缀'CL',无簇则为-1),
'unpIden': 序列的Uniprot(Uniprot)ID.版本号,
'range': 域残基起始-终止索引(遵循Uniprot序列的索引规则),
'id': 序列的索引编号,范围为0至N
}
示例如下:
{
'primary': 'ALQTTDKHHVATPANWRPGDDVIVPPPATQEAAEERLREG',
'protein_length': 40,
'family': 10417,
'clan': -1,
'unpIden': 'A0A147JSN0.1',
'range': '162-201',
'id': '0'
}
2. 微调用域序列
本部分采集了源自文献[Shin2021]的33种蛋白质的同源序列,这些序列为通过UniRef100(UniRef100)数据库检索得到的域序列。每个家族按9:1的比例随机划分为训练集与验证集。
每条序列的信息以键值对形式存储:
{
'unp_range': Uniprot记录名称/起始索引-终止索引(索引遵循Uniprot序列的索引规则),
'primary': 蛋白质氨基酸序列,
'seq_reweight': 源自Shin2021的序列加权评分,
'family_reweight': 源自Shin2021的家族加权评分(为该家族所有序列的seq_reweight评分之和),
'seq_reweight_mmseqs2': 本团队使用mmseqs2(mmseqs2)计算得到的序列加权评分,
'family_reweight_mmseqs2': 基于seq_reweight_mmseqs2计算得到的家族加权评分(为该家族所有序列的seq_reweight_mmseqs2评分之和)
}
示例如下:
{
'unp_range': 'AMIE_PSEAE/1-346',
'primary': 'MRHGDISSSNDTVGVAVVNYKMPRLHTAAEVLDNARKIAEMIVGMKQGLPGMDLVVFPEYSLQGIMYDPAEMMETAVAIPGEETEIFSRACRKANVWGVFSLTGERHEEHPRKAPYNTLVLIDNNGEIVQKYRKIIPWCPIEGWYPGGQTYVSEGPKGMKISLIICDDGNYPEIWRDCAMKGAELIVRCQGYMYPAKDQQVMMAKAMAWANNCYVAVANAAGFDGVYSYFGHSAIIGFDGRTLGECGEEEMGIQYAQLSLSQIRDARANDQSQNHLFKILHRGYSGLQASGDGDRGLAECPFEFYRTWVTDAEKARENVERLTRSTTGVAQCPVGRLPYEGLEKEA',
'seq_reweight': 0.0714285714286,
'family_reweight': 19553.99941694187,
'seq_reweight_mmseqs2': 0.0021413276231263384,
'family_reweight_mmseqs2': 25236.560885598774
}
3. 突变体适配度评分
本适配度基准数据集包含42套诱变数据集,最初由[DeepSequence]整理,后续[Shin2021]使用了其中的子集。
每个突变体的信息以键值对形式存储:
{
'set_nm': 数据集名称,
'wt_seq': 野生型序列,
'seq_len': 序列长度,
'mutants': 氨基酸突变体列表(可包含多位点突变),
'mut_relative_idxs': 突变体对应的相对氨基酸索引列表,
'mut_seq': 突变体序列,
'fitness': 适配度评分
}
示例如下:
{
'set_nm': 'AMIE_PSEAE_Whitehead',
'wt_seq': 'MRHGDISSSNDTVGVAVVNYKMPRLHTAAEVLDNARKIAEMIVGMKQGLPGMDLVVFPEYSLQGIMYDPAEMMETAVAIPGEETEIFSRACRKANVWGVFSLTGERHEEHPRKAPYNTLVLIDNNGEIVQKYRKIIPWCPIEGWYPGGQTYVSEGPKGMKISLIICDDGNYPEIWRDCAMKGAELIVRCQGYMYPAKDQQVMMAKAMAWANNCYVAVANAAGFDGVYSYFGHSAIIGFDGRTLGECGEEEMGIQYAQLSLSQIRDARANDQSQNHLFKILHRGYSGLQASGDGDRGLAECPFEFYRTWVTDAEKARENVERLTRSTTGVAQCPVGRLPYEG',
'seq_len': 341,
'mutants': ['M1W'],
'mut_relative_idxs': [0],
'mut_seq': 'WRHGDISSSNDTVGVAVVNYKMPRLHTAAEVLDNARKIAEMIVGMKQGLPGMDLVVFPEYSLQGIMYDPAEMMETAVAIPGEETEIFSRACRKANVWGVFSLTGERHEEHPRKAPYNTLVLIDNNGEIVQKYRKIIPWCPIEGWYPGGQTYVSEGPKGMKISLIICDDGNYPEIWRDCAMKGAELIVRCQGYMYPAKDQQVMMAKAMAWANNCYVAVANAAGFDGVYSYFGHSAIIGFDGRTLGECGEEEMGIQYAQLSLSQIRDARANDQSQNHLFKILHRGYSGLQASGDGDRGLAECPFEFYRTWVTDAEKARENVERLTRSTTGVAQCPVGRLPYEG',
'fitness': -0.5174
}
参考文献
DeepSequence:Riesselman, Adam J.、John B. Ingraham与Debora S. Marks。《Deep generative models of genetic variation capture the effects of mutations》,Nature methods 15.10 (2018): 816-822。
Shin2021:Shin, Jung-Eun等。《Protein design and variant prediction using autoregressive generative models》,Nature communications 12.1 (2021): 1-11。
创建时间:
2024-01-11



