graph-transformers/pinder
收藏Hugging Face2024-11-02 更新2025-11-01 收录
下载链接:
https://hf-mirror.com/datasets/graph-transformers/pinder
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: id
dtype: string
- name: cluster_id
dtype: string
- name: pdb_id
dtype: string
- name: complex
dtype:
protein_atom_array_feature:
residue_dictionary:
residue_names:
- ALA
- ARG
- ASN
- ASP
- CYS
- GLN
- GLU
- GLY
- HIS
- ILE
- LEU
- LYS
- MET
- PHE
- PRO
- SER
- THR
- TRP
- TYR
- VAL
- UNK
residue_types:
- A
- R
- N
- D
- C
- Q
- E
- G
- H
- I
- L
- K
- M
- F
- P
- S
- T
- W
- Y
- V
- X
atom_types:
- N
- CA
- C
- CB
- O
- CG
- CG1
- CG2
- OG
- OG1
- SG
- CD
- CD1
- CD2
- ND1
- ND2
- OD1
- OD2
- SD
- CE
- CE1
- CE2
- CE3
- NE
- NE1
- NE2
- OE1
- OE2
- CH2
- NH1
- NH2
- OH
- CZ
- CZ2
- CZ3
- NZ
- OXT
residue_atoms:
ALA:
- N
- CA
- C
- O
- CB
ARG:
- N
- CA
- C
- O
- CB
- CG
- CD
- NE
- CZ
- NH1
- NH2
ASP:
- N
- CA
- C
- O
- CB
- CG
- OD1
- OD2
ASN:
- N
- CA
- C
- O
- CB
- CG
- OD1
- ND2
CYS:
- N
- CA
- C
- O
- CB
- SG
SEC:
- N
- CA
- C
- O
- CB
- SE
GLU:
- N
- CA
- C
- O
- CB
- CG
- CD
- OE1
- OE2
GLN:
- N
- CA
- C
- O
- CB
- CG
- CD
- OE1
- NE2
GLY:
- N
- CA
- C
- O
HIS:
- N
- CA
- C
- O
- CB
- CG
- ND1
- CE1
- NE2
- CD2
ILE:
- N
- CA
- C
- O
- CB
- CG1
- CG2
- CD1
LEU:
- N
- CA
- C
- O
- CB
- CG
- CD1
- CD2
LYS:
- N
- CA
- C
- O
- CB
- CG
- CD
- CE
- NZ
MET:
- N
- CA
- C
- O
- CB
- CG
- SD
- CE
MSE:
- N
- CA
- C
- O
- CB
- CG
- SE
- CE
PHE:
- N
- CA
- C
- O
- CB
- CG
- CD1
- CD2
- CE1
- CE2
- CZ
PRO:
- N
- CA
- C
- O
- CB
- CG
- CD
SER:
- N
- CA
- C
- O
- CB
- OG
THR:
- N
- CA
- C
- O
- CB
- OG1
- CG2
TRP:
- N
- CA
- C
- O
- CB
- CG
- CD1
- NE1
- CE2
- CD2
- CE3
- CZ2
- CZ3
- CH2
TYR:
- N
- CA
- C
- O
- CB
- CG
- CD1
- CD2
- CE1
- CE2
- CZ
- OH
VAL:
- N
- CA
- C
- O
- CB
- CG1
- CG2
UNK:
- N
- CA
- C
- O
backbone_atoms:
- N
- CA
- C
- O
unknown_residue_name: UNK
conversions:
- residue: MSE
to_residue: MET
atom_swaps:
- - SE
- SD
- residue: SEC
to_residue: CYS
atom_swaps:
- - SE
- SG
with_res_id: true
- name: apo_receptor
dtype:
protein_atom_array_feature:
residue_dictionary:
residue_names:
- ALA
- ARG
- ASN
- ASP
- CYS
- GLN
- GLU
- GLY
- HIS
- ILE
- LEU
- LYS
- MET
- PHE
- PRO
- SER
- THR
- TRP
- TYR
- VAL
- UNK
residue_types:
- A
- R
- N
- D
- C
- Q
- E
- G
- H
- I
- L
- K
- M
- F
- P
- S
- T
- W
- Y
- V
- X
atom_types:
- N
- CA
- C
- CB
- O
- CG
- CG1
- CG2
- OG
- OG1
- SG
- CD
- CD1
- CD2
- ND1
- ND2
- OD1
- OD2
- SD
- CE
- CE1
- CE2
- CE3
- NE
- NE1
- NE2
- OE1
- OE2
- CH2
- NH1
- NH2
- OH
- CZ
- CZ2
- CZ3
- NZ
- OXT
residue_atoms:
ALA:
- N
- CA
- C
- O
- CB
ARG:
- N
- CA
- C
- O
- CB
- CG
- CD
- NE
- CZ
- NH1
- NH2
ASP:
- N
- CA
- C
- O
- CB
- CG
- OD1
- OD2
ASN:
- N
- CA
- C
- O
- CB
- CG
- OD1
- ND2
CYS:
- N
- CA
- C
- O
- CB
- SG
SEC:
- N
- CA
- C
- O
- CB
- SE
GLU:
- N
- CA
- C
- O
- CB
- CG
- CD
- OE1
- OE2
GLN:
- N
- CA
- C
- O
- CB
- CG
- CD
- OE1
- NE2
GLY:
- N
- CA
- C
- O
HIS:
- N
- CA
- C
- O
- CB
- CG
- ND1
- CE1
- NE2
- CD2
ILE:
- N
- CA
- C
- O
- CB
- CG1
- CG2
- CD1
LEU:
- N
- CA
- C
- O
- CB
- CG
- CD1
- CD2
LYS:
- N
- CA
- C
- O
- CB
- CG
- CD
- CE
- NZ
MET:
- N
- CA
- C
- O
- CB
- CG
- SD
- CE
MSE:
- N
- CA
- C
- O
- CB
- CG
- SE
- CE
PHE:
- N
- CA
- C
- O
- CB
- CG
- CD1
- CD2
- CE1
- CE2
- CZ
PRO:
- N
- CA
- C
- O
- CB
- CG
- CD
SER:
- N
- CA
- C
- O
- CB
- OG
THR:
- N
- CA
- C
- O
- CB
- OG1
- CG2
TRP:
- N
- CA
- C
- O
- CB
- CG
- CD1
- NE1
- CE2
- CD2
- CE3
- CZ2
- CZ3
- CH2
TYR:
- N
- CA
- C
- O
- CB
- CG
- CD1
- CD2
- CE1
- CE2
- CZ
- OH
VAL:
- N
- CA
- C
- O
- CB
- CG1
- CG2
UNK:
- N
- CA
- C
- O
backbone_atoms:
- N
- CA
- C
- O
unknown_residue_name: UNK
conversions:
- residue: MSE
to_residue: MET
atom_swaps:
- - SE
- SD
- residue: SEC
to_residue: CYS
atom_swaps:
- - SE
- SG
with_res_id: true
- name: apo_ligand
dtype:
protein_atom_array_feature:
residue_dictionary:
residue_names:
- ALA
- ARG
- ASN
- ASP
- CYS
- GLN
- GLU
- GLY
- HIS
- ILE
- LEU
- LYS
- MET
- PHE
- PRO
- SER
- THR
- TRP
- TYR
- VAL
- UNK
residue_types:
- A
- R
- N
- D
- C
- Q
- E
- G
- H
- I
- L
- K
- M
- F
- P
- S
- T
- W
- Y
- V
- X
atom_types:
- N
- CA
- C
- CB
- O
- CG
- CG1
- CG2
- OG
- OG1
- SG
- CD
- CD1
- CD2
- ND1
- ND2
- OD1
- OD2
- SD
- CE
- CE1
- CE2
- CE3
- NE
- NE1
- NE2
- OE1
- OE2
- CH2
- NH1
- NH2
- OH
- CZ
- CZ2
- CZ3
- NZ
- OXT
residue_atoms:
ALA:
- N
- CA
- C
- O
- CB
ARG:
- N
- CA
- C
- O
- CB
- CG
- CD
- NE
- CZ
- NH1
- NH2
ASP:
- N
- CA
- C
- O
- CB
- CG
- OD1
- OD2
ASN:
- N
- CA
- C
- O
- CB
- CG
- OD1
- ND2
CYS:
- N
- CA
- C
- O
- CB
- SG
SEC:
- N
- CA
- C
- O
- CB
- SE
GLU:
- N
- CA
- C
- O
- CB
- CG
- CD
- OE1
- OE2
GLN:
- N
- CA
- C
- O
- CB
- CG
- CD
- OE1
- NE2
GLY:
- N
- CA
- C
- O
HIS:
- N
- CA
- C
- O
- CB
- CG
- ND1
- CE1
- NE2
- CD2
ILE:
- N
- CA
- C
- O
- CB
- CG1
- CG2
- CD1
LEU:
- N
- CA
- C
- O
- CB
- CG
- CD1
- CD2
LYS:
- N
- CA
- C
- O
- CB
- CG
- CD
- CE
- NZ
MET:
- N
- CA
- C
- O
- CB
- CG
- SD
- CE
MSE:
- N
- CA
- C
- O
- CB
- CG
- SE
- CE
PHE:
- N
- CA
- C
- O
- CB
- CG
- CD1
- CD2
- CE1
- CE2
- CZ
PRO:
- N
- CA
- C
- O
- CB
- CG
- CD
SER:
- N
- CA
- C
- O
- CB
- OG
THR:
- N
- CA
- C
- O
- CB
- OG1
- CG2
TRP:
- N
- CA
- C
- O
- CB
- CG
- CD1
- NE1
- CE2
- CD2
- CE3
- CZ2
- CZ3
- CH2
TYR:
- N
- CA
- C
- O
- CB
- CG
- CD1
- CD2
- CE1
- CE2
- CZ
- OH
VAL:
- N
- CA
- C
- O
- CB
- CG1
- CG2
UNK:
- N
- CA
- C
- O
backbone_atoms:
- N
- CA
- C
- O
unknown_residue_name: UNK
conversions:
- residue: MSE
to_residue: MET
atom_swaps:
- - SE
- SD
- residue: SEC
to_residue: CYS
atom_swaps:
- - SE
- SG
with_res_id: true
- name: pred_receptor
dtype:
protein_atom_array_feature:
residue_dictionary:
residue_names:
- ALA
- ARG
- ASN
- ASP
- CYS
- GLN
- GLU
- GLY
- HIS
- ILE
- LEU
- LYS
- MET
- PHE
- PRO
- SER
- THR
- TRP
- TYR
- VAL
- UNK
residue_types:
- A
- R
- N
- D
- C
- Q
- E
- G
- H
- I
- L
- K
- M
- F
- P
- S
- T
- W
- Y
- V
- X
atom_types:
- N
- CA
- C
- CB
- O
- CG
- CG1
- CG2
- OG
- OG1
- SG
- CD
- CD1
- CD2
- ND1
- ND2
- OD1
- OD2
- SD
- CE
- CE1
- CE2
- CE3
- NE
- NE1
- NE2
- OE1
- OE2
- CH2
- NH1
- NH2
- OH
- CZ
- CZ2
- CZ3
- NZ
- OXT
residue_atoms:
ALA:
- N
- CA
- C
- O
- CB
ARG:
- N
- CA
- C
- O
- CB
- CG
- CD
- NE
- CZ
- NH1
- NH2
ASP:
- N
- CA
- C
- O
- CB
- CG
- OD1
- OD2
ASN:
- N
- CA
- C
- O
- CB
- CG
- OD1
- ND2
CYS:
- N
- CA
- C
- O
- CB
- SG
SEC:
- N
- CA
- C
- O
- CB
- SE
GLU:
- N
- CA
- C
- O
- CB
- CG
- CD
- OE1
- OE2
GLN:
- N
- CA
- C
- O
- CB
- CG
- CD
- OE1
- NE2
GLY:
- N
- CA
- C
- O
HIS:
- N
- CA
- C
- O
- CB
- CG
- ND1
- CE1
- NE2
- CD2
ILE:
- N
- CA
- C
- O
- CB
- CG1
- CG2
- CD1
LEU:
- N
- CA
- C
- O
- CB
- CG
- CD1
- CD2
LYS:
- N
- CA
- C
- O
- CB
- CG
- CD
- CE
- NZ
MET:
- N
- CA
- C
- O
- CB
- CG
- SD
- CE
MSE:
- N
- CA
- C
- O
- CB
- CG
- SE
- CE
PHE:
- N
- CA
- C
- O
- CB
- CG
- CD1
- CD2
- CE1
- CE2
- CZ
PRO:
- N
- CA
- C
- O
- CB
- CG
- CD
SER:
- N
- CA
- C
- O
- CB
- OG
THR:
- N
- CA
- C
- O
- CB
- OG1
- CG2
TRP:
- N
- CA
- C
- O
- CB
- CG
- CD1
- NE1
- CE2
- CD2
- CE3
- CZ2
- CZ3
- CH2
TYR:
- N
- CA
- C
- O
- CB
- CG
- CD1
- CD2
- CE1
- CE2
- CZ
- OH
VAL:
- N
- CA
- C
- O
- CB
- CG1
- CG2
UNK:
- N
- CA
- C
- O
backbone_atoms:
- N
- CA
- C
- O
unknown_residue_name: UNK
conversions:
- residue: MSE
to_residue: MET
atom_swaps:
- - SE
- SD
- residue: SEC
to_residue: CYS
atom_swaps:
- - SE
- SG
with_res_id: true
- name: pred_ligand
dtype:
protein_atom_array_feature:
residue_dictionary:
residue_names:
- ALA
- ARG
- ASN
- ASP
- CYS
- GLN
- GLU
- GLY
- HIS
- ILE
- LEU
- LYS
- MET
- PHE
- PRO
- SER
- THR
- TRP
- TYR
- VAL
- UNK
residue_types:
- A
- R
- N
- D
- C
- Q
- E
- G
- H
- I
- L
- K
- M
- F
- P
- S
- T
- W
- Y
- V
- X
atom_types:
- N
- CA
- C
- CB
- O
- CG
- CG1
- CG2
- OG
- OG1
- SG
- CD
- CD1
- CD2
- ND1
- ND2
- OD1
- OD2
- SD
- CE
- CE1
- CE2
- CE3
- NE
- NE1
- NE2
- OE1
- OE2
- CH2
- NH1
- NH2
- OH
- CZ
- CZ2
- CZ3
- NZ
- OXT
residue_atoms:
ALA:
- N
- CA
- C
- O
- CB
ARG:
- N
- CA
- C
- O
- CB
- CG
- CD
- NE
- CZ
- NH1
- NH2
ASP:
- N
- CA
- C
- O
- CB
- CG
- OD1
- OD2
ASN:
- N
- CA
- C
- O
- CB
- CG
- OD1
- ND2
CYS:
- N
- CA
- C
- O
- CB
- SG
SEC:
- N
- CA
- C
- O
- CB
- SE
GLU:
- N
- CA
- C
- O
- CB
- CG
- CD
- OE1
- OE2
GLN:
- N
- CA
- C
- O
- CB
- CG
- CD
- OE1
- NE2
GLY:
- N
- CA
- C
- O
HIS:
- N
- CA
- C
- O
- CB
- CG
- ND1
- CE1
- NE2
- CD2
ILE:
- N
- CA
- C
- O
- CB
- CG1
- CG2
- CD1
LEU:
- N
- CA
- C
- O
- CB
- CG
- CD1
- CD2
LYS:
- N
- CA
- C
- O
- CB
- CG
- CD
- CE
- NZ
MET:
- N
- CA
- C
- O
- CB
- CG
- SD
- CE
MSE:
- N
- CA
- C
- O
- CB
- CG
- SE
- CE
PHE:
- N
- CA
- C
- O
- CB
- CG
- CD1
- CD2
- CE1
- CE2
- CZ
PRO:
- N
- CA
- C
- O
- CB
- CG
- CD
SER:
- N
- CA
- C
- O
- CB
- OG
THR:
- N
- CA
- C
- O
- CB
- OG1
- CG2
TRP:
- N
- CA
- C
- O
- CB
- CG
- CD1
- NE1
- CE2
- CD2
- CE3
- CZ2
- CZ3
- CH2
TYR:
- N
- CA
- C
- O
- CB
- CG
- CD1
- CD2
- CE1
- CE2
- CZ
- OH
VAL:
- N
- CA
- C
- O
- CB
- CG1
- CG2
UNK:
- N
- CA
- C
- O
backbone_atoms:
- N
- CA
- C
- O
unknown_residue_name: UNK
conversions:
- residue: MSE
to_residue: MET
atom_swaps:
- - SE
- SD
- residue: SEC
to_residue: CYS
atom_swaps:
- - SE
- SG
with_res_id: true
- name: receptor_uniprot_accession
dtype: string
- name: ligand_uniprot_accession
dtype: string
- name: receptor_uniprot_seq
dtype: string
- name: ligand_uniprot_seq
dtype: string
- name: receptor_resids_with_uniprot_mapping
sequence: uint16
- name: receptor_mapped_uniprot_resids
sequence: uint16
- name: ligand_resids_with_uniprot_mapping
sequence: uint16
- name: ligand_mapped_uniprot_resids
sequence: uint16
- name: oligomeric_count
dtype: uint16
- name: resolution
dtype: float16
- name: probability
dtype: float16
- name: method
dtype: string
splits:
- name: pinder_xl
num_bytes: 416782224.0
num_examples: 1955
- name: val
num_bytes: 395833402.0
num_examples: 1958
- name: pinder_s
num_bytes: 58797916.0
num_examples: 250
- name: pinder_af2
num_bytes: 31709642.0
num_examples: 180
download_size: 402402974
dataset_size: 903123184.0
configs:
- config_name: default
data_files:
- split: pinder_s
path: data/pinder_s-*
- split: pinder_af2
path: data/pinder_af2-*
- split: pinder_xl
path: data/pinder_xl-*
- split: val
path: data/val-*
---
该数据集信息如下:
## 特征字段
1. **id**:字符串类型,样本唯一标识符
2. **cluster_id**:字符串类型,聚类编号
3. **pdb_id**:字符串类型,蛋白质数据库(Protein Data Bank, PDB)编号
4. **complex**:蛋白质原子数组特征(protein atom array feature),其残基字典(residue dictionary)包含以下配置:
- 残基名称列表:包含丙氨酸(ALA)、精氨酸(ARG)、天冬酰胺(ASN)、天冬氨酸(ASP)、半胱氨酸(CYS)、谷氨酰胺(GLN)、谷氨酸(GLU)、甘氨酸(GLY)、组氨酸(HIS)、异亮氨酸(ILE)、亮氨酸(LEU)、赖氨酸(LYS)、甲硫氨酸(MET)、苯丙氨酸(PHE)、脯氨酸(PRO)、丝氨酸(SER)、苏氨酸(THR)、色氨酸(TRP)、酪氨酸(TYR)、缬氨酸(VAL)及未知残基(UNK)
- 残基类型单字母缩写:对应为A、R、N、D、C、Q、E、G、H、I、L、K、M、F、P、S、T、W、Y、V、X
- 支持的原子类型:包含N、CA、C、CB、O、CG、CG1、CG2、OG、OG1、SG、CD、CD1、CD2、ND1、ND2、OD1、OD2、SD、CE、CE1、CE2、CE3、NE、NE1、NE2、OE1、OE2、CH2、NH1、NH2、OH、CZ、CZ2、CZ3、NZ、OXT等
- 残基-原子映射关系:详细定义各标准残基对应的原子集合,例如丙氨酸(ALA)包含N、CA、C、O、CB原子;精氨酸(ARG)包含N、CA、C、O、CB、CG、CD、NE、CZ、NH1、NH2原子;同时包含硒代半胱氨酸(SEC)、硒代甲硫氨酸(MSE)等特殊残基的原子定义
- 主链原子集合:固定为N、CA、C、O四类标准蛋白质主链原子
- 未知残基标识:设置为UNK
- 残基转换规则:包含两条转换策略:①将硒代甲硫氨酸(MSE)转换为甲硫氨酸(MET),原子替换为SE替换SD;②将硒代半胱氨酸(SEC)转换为半胱氨酸(CYS),原子替换为SE替换SG
- 启用残基编号:字段值为真
5. **apo_receptor**:蛋白质原子数组特征,结构与上述complex字段完全一致,代表未结合配体的空载受体蛋白
6. **apo_ligand**:蛋白质原子数组特征,结构与上述complex字段完全一致,代表未结合配体的空载配体(符合数据集原始定义)
7. **pred_receptor**:蛋白质原子数组特征,结构与上述complex字段完全一致,代表预测得到的受体蛋白
8. **pred_ligand**:蛋白质原子数组特征,结构与上述complex字段完全一致,代表预测得到的配体
9. **receptor_uniprot_accession**:字符串类型,受体蛋白的UniProt(Universal Protein Resource)登录号
10. **ligand_uniprot_accession**:字符串类型,配体的UniProt登录号
11. **receptor_uniprot_seq**:字符串类型,受体蛋白的UniProt参考序列
12. **ligand_uniprot_seq**:字符串类型,配体的UniProt参考序列
13. **receptor_resids_with_uniprot_mapping**:无符号16位整数序列,代表带有UniProt序列映射的受体残基编号集合
14. **receptor_mapped_uniprot_resids**:无符号16位整数序列,代表映射至UniProt参考序列的受体残基编号集合
15. **ligand_resids_with_uniprot_mapping**:无符号16位整数序列,代表带有UniProt序列映射的配体残基编号集合
16. **ligand_mapped_uniprot_resids**:无符号16位整数序列,代表映射至UniProt参考序列的配体残基编号集合
17. **oligomeric_count**:无符号16位整数,靶标复合物的寡聚体计数
18. **resolution**:半精度浮点数,对应复合物结构解析的分辨率
19. **probability**:半精度浮点数,模型预测置信概率
20. **method**:字符串类型,复合物结构解析采用的实验方法(如X射线衍射、冷冻电镜等)
## 数据集拆分
该数据集共包含4个拆分子集:
- `pinder_xl`拆分:总数据量416782224.0字节,包含1955个样本
- `val`(验证集)拆分:总数据量395833402.0字节,包含1958个样本
- `pinder_s`拆分:总数据量58797916.0字节,包含250个样本
- `pinder_af2`拆分:总数据量31709642.0字节,包含180个样本
整体数据集总下载大小为402402974字节,总存储大小为903123184字节
## 数据集配置
默认配置(default)对应的数据文件路径如下:
- `pinder_s`拆分:对应`data/pinder_s-*`
- `pinder_af2`拆分:对应`data/pinder_af2-*`
- `pinder_xl`拆分:对应`data/pinder_xl-*`
- `val`拆分:对应`data/val-*`
提供机构:
graph-transformers



