PFAM Protein Families Dataset for Machine Learning
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/8132094
下载链接
链接失效反馈官方服务:
资源简介:
A cleaned dataset of protein sequences and protein families for classification. The dataset is exported from PFAM as of June 2023 and curated to achieve the following characteristics:
only protein families included with >=100 sequences
families with >2000 sequences are truncated and only represented by 2000 sequences (chosen randomly)
only proteins with sequence lengths between 100 and 1000
amino acid sequences are form PDB; chains are concatenated only if not similar
The dataset is not balanced, numbers of sequences per family in PFAM and in in dataset are:
families: 62, sequences: 46872
total (in PFAM) -> included (in dataset)
Number in family ALLERGEN: 122 -> 122
Number in family APOPTOSIS: 381 -> 381
Number in family BIOSYNTHETIC PROTEIN: 346 -> 346
Number in family BIOTIN BINDING PROTEIN: 165 -> 165
Number in family BLOOD CLOTTING: 138 -> 138
Number in family CALCIUM BINDING PROTEIN: 135 -> 135
Number in family CELL ADHESION: 1116 -> 1116
Number in family CELL CYCLE: 511 -> 511
Number in family CHAPERONE: 964 -> 964
Number in family CONTRACTILE PROTEIN: 158 -> 158
Number in family CYTOKINE: 191 -> 191
Number in family DE NOVO PROTEIN: 253 -> 253
Number in family DNA BINDING PROTEIN: 1008 -> 1008
Number in family ELECTRON TRANSPORT: 841 -> 841
Number in family FLUORESCENT PROTEIN: 348 -> 348
Number in family GENE REGULATION: 607 -> 607
Number in family HORMONE: 272 -> 272
Number in family HORMONE GROWTH FACTOR: 159 -> 159
Number in family HORMONE RECEPTOR: 121 -> 121
Number in family HYDROLASE: 19551 -> 2000
Number in family HYDROLASE ANTIBIOTIC: 120 -> 120
Number in family HYDROLASE HYDROLASE INHIBITOR: 2890 -> 2000
Number in family HYDROLASE INHIBITOR: 315 -> 315
Number in family IMMUNE SYSTEM: 3333 -> 2000
Number in family IMMUNOGLOBULIN: 155 -> 155
Number in family ISOMERASE: 2457 -> 2000
Number in family ISOMERASE ISOMERASE INHIBITOR: 139 -> 139
Number in family LECTIN: 139 -> 139
Number in family LIGASE: 1780 -> 1780
Number in family LIGASE LIGASE INHIBITOR: 163 -> 163
Number in family LIPID BINDING PROTEIN: 421 -> 421
Number in family LIPID TRANSPORT: 115 -> 115
Number in family LUMINESCENT PROTEIN: 221 -> 221
Number in family LYASE: 4150 -> 2000
Number in family LYASE LYASE INHIBITOR: 298 -> 298
Number in family MEMBRANE PROTEIN: 1338 -> 1338
Number in family METAL BINDING PROTEIN: 951 -> 951
Number in family METAL TRANSPORT: 409 -> 409
Number in family MOTOR PROTEIN: 195 -> 195
Number in family OXIDOREDUCTASE: 11531 -> 2000
Number in family OXIDOREDUCTASE OXIDOREDUCTASE INHIBITOR: 766 -> 766
Number in family OXYGEN STORAGE: 127 -> 127
Number in family OXYGEN STORAGE TRANSPORT: 260 -> 260
Number in family OXYGEN TRANSPORT: 414 -> 414
Number in family PHOTOSYNTHESIS: 173 -> 173
Number in family PLANT PROTEIN: 255 -> 255
Number in family PROTEIN BINDING: 1613 -> 1613
Number in family PROTEIN TRANSPORT: 693 -> 693
Number in family RECEPTOR: 108 -> 108
Number in family REPLICATION: 161 -> 161
Number in family RNA BINDING PROTEIN: 546 -> 546
Number in family SIGNALING PROTEIN: 2312 -> 2000
Number in family STRUCTURAL PROTEIN: 869 -> 869
Number in family SUGAR BINDING PROTEIN: 1250 -> 1250
Number in family TOXIN: 546 -> 546
Number in family TRANSCRIPTION REGULATION: 3283 -> 2000
Number in family TRANSFERASE: 14724 -> 2000
Number in family TRANSFERASE INHIBITOR: 126 -> 126
Number in family TRANSFERASE TRANSFERASE INHIBITOR: 2465 -> 2000
Number in family TRANSLATION: 370 -> 370
Number in family TRANSPORT PROTEIN: 2782 -> 2000
Number in family VIRAL PROTEIN: 2150 -> 2000
Files:
families.csv: list of protein families with frequencies
pfam_46872x62.csv: full dataset with amino acid sequences as string (one-letter code)
pfam-trn-xy.csv: training dataset with amino acid sequences as tokens (1..25) and padded to a common length of 1000 with padding token 0:
Amino acid | Token | Description
--------------------------------
C | 1 | Cysteine
S | 2 | Serine
T | 3 | Threonine
A | 4 | Alanine
G | 5 | Glycine
P | 6 | Proline
D | 7 | Aspartic acid
E | 8 | Glutamic acid
Q | 9 | Glutamine
N | 10 | Asparagine
H | 11 | Histidine
R | 12 | Arginine
K | 13 | Lysine
M | 14 | Methionine
I | 15 | Isoleucine
L | 16 | Leucine
V | 17 | Valine
W | 18 | Tryptophan
Y | 19 | Tyrosine
F | 20 | Phenylalanine
B | 21 | Aspartic acid or Asparagine
Z | 22 | Glutamic acid or Glutamine
J | 23 | Leucine or Isoleucine
U | 24 | Selenocysteine
X | 25 | Unknown amino acid
. | 0 | padding token
pfam-trn-labels.csv: plain-text labels for training data
pfam-tst-xy.csv
pfam-tst-labels.csv: test data
pfam-balanced-trn-xy.csv
pfam-balanced-trn-labels.csv:
pfam-balanced-tst-xy.csv
pfam-balanced-tst-labels.csv: balanced datasets, created by oversampling.
创建时间:
2023-07-20



