lamm-mit/structural-protein-families
收藏Hugging Face2026-05-29 更新2026-05-31 收录
下载链接:
https://hf-mirror.com/datasets/lamm-mit/structural-protein-families
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含来自多个结构/生物材料蛋白质家族(蜘蛛丝蛋白、蚕丝蛋白、胶原蛋白、弹性蛋白、节肢弹性蛋白、角蛋白)以及球状对照蛋白(溶菌酶、肌红蛋白、细胞色素c)的氨基酸序列窗口。序列从UniProt数据库获取,并切分为长度不超过200个氨基酸的重叠窗口(步长为150,每个蛋白质最多4个窗口);每行数据带有family类别标签。数据集专为迁移学习演示而构建,用于在冻结的ESMC嵌入上训练轻量级分类头部,以识别蛋白质家族。
Amino-acid sequence windows from several structural / biomaterials protein families (spider spidroins, silkworm fibroin, collagen, elastin, resilin, keratin) plus globular controls (lysozyme, myoglobin, cytochrome c). Sequences are fetched from UniProt and sliced into overlapping windows; each row carries a family label. Built for the transfer-learning demo to train a lightweight head on frozen ESMC embeddings to classify family.
提供机构:
lamm-mit



