five

PFAM Protein Families Dataset for Machine Learning

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/8132094
下载链接
链接失效反馈
官方服务:
资源简介:
A cleaned dataset of protein sequences and protein families for classification. The dataset is exported from PFAM as of June 2023 and curated to achieve the following characteristics: only protein families included with >=100 sequences families with >2000 sequences are truncated and only represented by 2000 sequences (chosen randomly) only proteins with sequence lengths between 100 and 1000 amino acid sequences are form PDB; chains are concatenated only if not similar The dataset is not balanced, numbers of sequences per family in PFAM and in in dataset are: families: 62, sequences: 46872 total (in PFAM) -> included (in dataset) Number in family ALLERGEN: 122 -> 122 Number in family APOPTOSIS: 381 -> 381 Number in family BIOSYNTHETIC PROTEIN: 346 -> 346 Number in family BIOTIN BINDING PROTEIN: 165 -> 165 Number in family BLOOD CLOTTING: 138 -> 138 Number in family CALCIUM BINDING PROTEIN: 135 -> 135 Number in family CELL ADHESION: 1116 -> 1116 Number in family CELL CYCLE: 511 -> 511 Number in family CHAPERONE: 964 -> 964 Number in family CONTRACTILE PROTEIN: 158 -> 158 Number in family CYTOKINE: 191 -> 191 Number in family DE NOVO PROTEIN: 253 -> 253 Number in family DNA BINDING PROTEIN: 1008 -> 1008 Number in family ELECTRON TRANSPORT: 841 -> 841 Number in family FLUORESCENT PROTEIN: 348 -> 348 Number in family GENE REGULATION: 607 -> 607 Number in family HORMONE: 272 -> 272 Number in family HORMONE GROWTH FACTOR: 159 -> 159 Number in family HORMONE RECEPTOR: 121 -> 121 Number in family HYDROLASE: 19551 -> 2000 Number in family HYDROLASE ANTIBIOTIC: 120 -> 120 Number in family HYDROLASE HYDROLASE INHIBITOR: 2890 -> 2000 Number in family HYDROLASE INHIBITOR: 315 -> 315 Number in family IMMUNE SYSTEM: 3333 -> 2000 Number in family IMMUNOGLOBULIN: 155 -> 155 Number in family ISOMERASE: 2457 -> 2000 Number in family ISOMERASE ISOMERASE INHIBITOR: 139 -> 139 Number in family LECTIN: 139 -> 139 Number in family LIGASE: 1780 -> 1780 Number in family LIGASE LIGASE INHIBITOR: 163 -> 163 Number in family LIPID BINDING PROTEIN: 421 -> 421 Number in family LIPID TRANSPORT: 115 -> 115 Number in family LUMINESCENT PROTEIN: 221 -> 221 Number in family LYASE: 4150 -> 2000 Number in family LYASE LYASE INHIBITOR: 298 -> 298 Number in family MEMBRANE PROTEIN: 1338 -> 1338 Number in family METAL BINDING PROTEIN: 951 -> 951 Number in family METAL TRANSPORT: 409 -> 409 Number in family MOTOR PROTEIN: 195 -> 195 Number in family OXIDOREDUCTASE: 11531 -> 2000 Number in family OXIDOREDUCTASE OXIDOREDUCTASE INHIBITOR: 766 -> 766 Number in family OXYGEN STORAGE: 127 -> 127 Number in family OXYGEN STORAGE TRANSPORT: 260 -> 260 Number in family OXYGEN TRANSPORT: 414 -> 414 Number in family PHOTOSYNTHESIS: 173 -> 173 Number in family PLANT PROTEIN: 255 -> 255 Number in family PROTEIN BINDING: 1613 -> 1613 Number in family PROTEIN TRANSPORT: 693 -> 693 Number in family RECEPTOR: 108 -> 108 Number in family REPLICATION: 161 -> 161 Number in family RNA BINDING PROTEIN: 546 -> 546 Number in family SIGNALING PROTEIN: 2312 -> 2000 Number in family STRUCTURAL PROTEIN: 869 -> 869 Number in family SUGAR BINDING PROTEIN: 1250 -> 1250 Number in family TOXIN: 546 -> 546 Number in family TRANSCRIPTION REGULATION: 3283 -> 2000 Number in family TRANSFERASE: 14724 -> 2000 Number in family TRANSFERASE INHIBITOR: 126 -> 126 Number in family TRANSFERASE TRANSFERASE INHIBITOR: 2465 -> 2000 Number in family TRANSLATION: 370 -> 370 Number in family TRANSPORT PROTEIN: 2782 -> 2000 Number in family VIRAL PROTEIN: 2150 -> 2000 Files: families.csv: list of protein families with frequencies pfam_46872x62.csv: full dataset with amino acid sequences as string (one-letter code) pfam-trn-xy.csv: training dataset with amino acid sequences as tokens (1..25) and padded to a common length of 1000 with padding token 0: Amino acid | Token | Description -------------------------------- C | 1 | Cysteine S | 2 | Serine T | 3 | Threonine A | 4 | Alanine G | 5 | Glycine P | 6 | Proline D | 7 | Aspartic acid E | 8 | Glutamic acid Q | 9 | Glutamine N | 10 | Asparagine H | 11 | Histidine R | 12 | Arginine K | 13 | Lysine M | 14 | Methionine I | 15 | Isoleucine L | 16 | Leucine V | 17 | Valine W | 18 | Tryptophan Y | 19 | Tyrosine F | 20 | Phenylalanine B | 21 | Aspartic acid or Asparagine Z | 22 | Glutamic acid or Glutamine J | 23 | Leucine or Isoleucine U | 24 | Selenocysteine X | 25 | Unknown amino acid . | 0 | padding token   pfam-trn-labels.csv: plain-text labels for training data pfam-tst-xy.csv pfam-tst-labels.csv: test data pfam-balanced-trn-xy.csv pfam-balanced-trn-labels.csv: pfam-balanced-tst-xy.csv pfam-balanced-tst-labels.csv: balanced datasets, created by oversampling.
创建时间:
2023-07-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作