silkome-masp
收藏Hugging Face2026-05-30 更新2026-05-31 收录
下载链接:
https://huggingface.co/datasets/lamm-mit/silkome-masp
下载链接
链接失效反馈官方服务:
资源简介:
Silkome MaSp数据集是一个专注于主要壶腹丝蛋白(MaSp)的序列到属性数据集,用于SilkomeGPT研究。该数据集从完整的silkome-full数据集中筛选出category1为MaSp、MaSp1、MaSp2、MaSp2B、MaSp3、MaSp3B的行,旨在重现论文中面向MaSp的序列到纤维属性对应关系:将蛋白质级的丝蛋白序列与纤维级的机械测量值配对。数据集包含1,033行数据,涉及1,028个独特的氨基酸序列和233个独特的纤维/属性标识符(idv)。序列长度范围从115到1,854个氨基酸,中位数为378。数据提供了完整集(full)、训练集(train)和测试集(test)三种分割。基准分割采用确定性随机分组方法,以idv为单位进行分组,确保训练集和测试集之间在idv和属性元组上均无重叠,以防止数据泄漏。数据集列包括:序列和元数据(如idv、sequence、length、分类学信息category1/category2、物种信息、来源分割source_split、基准分割benchmark_split等);八个核心纤维级机械属性(韧性toughness及其标准差、杨氏模量E及其标准差、强度strength及其标准差、应变strain及其标准差,这些核心值无缺失);以及这些属性的归一化版本(如toughnessNorm等),它们构成了SilkomeGPT条件生成中使用的8维属性向量。此外,数据集还保留了一些额外的材料属性列。该数据集适用于蜘蛛丝蛋白语言建模、序列到属性预测、条件序列生成、蛋白质/材料设计工作流等研究。需要特别注意,数据集呈现的是一种弱监督关系:机械目标是纤维级测量结果,而每个序列是蛋白质级丝蛋白序列,因为蜘蛛丝纤维是复杂的多组分材料,多个丝蛋白序列可能共享相同的测量纤维属性。
The Silkome MaSp dataset is a sequence-to-property dataset focused on major ampullate spidroins (MaSp) for SilkomeGPT research. It is filtered from the complete silkome-full dataset to include rows where category1 is MaSp, MaSp1, MaSp2, MaSp2B, MaSp3, or MaSp3B, aiming to reproduce the MaSp-oriented sequence-to-fiber property correspondence in the paper by pairing protein-level silk protein sequences with fiber-level mechanical measurements. The dataset contains 1,033 rows, involving 1,028 unique amino acid sequences and 233 unique fiber/property identifiers (idv). Sequence lengths range from 115 to 1,854 amino acids, with a median of 378. It provides three splits: full, train, and test. The benchmark split uses a deterministic random grouping method based on idv to ensure no overlap in idv and property tuples between the training and test sets, preventing data leakage. The dataset columns include: sequences and metadata (e.g., idv, sequence, length, taxonomic information category1/category2, species information, source_split, benchmark_split, etc.); eight core fiber-level mechanical properties (toughness and its standard deviation, Youngs modulus E and its standard deviation, strength and its standard deviation, strain and its standard deviation, with no missing values for these core properties); and normalized versions of these properties (e.g., toughnessNorm), which form the 8-dimensional property vector used in SilkomeGPT conditional generation. Additionally, the dataset retains some extra material property columns. It is suitable for research in spider silk protein language modeling, sequence-to-property prediction, conditional sequence generation, and protein/material design workflows. It is important to note that the dataset presents a weak supervision relationship: mechanical targets are fiber-level measurements, while each sequence is a protein-level silk protein sequence, as spider silk fibers are complex multi-component materials where multiple silk protein sequences may share the same measured fiber properties.
提供机构:
LAMM: MIT Laboratory for Atomistic and Molecular Mechanics
创建时间:
2026-05-30



