five

graph-transformers/pinder

收藏
Hugging Face2024-11-02 更新2025-11-01 收录
下载链接:
https://hf-mirror.com/datasets/graph-transformers/pinder
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: id dtype: string - name: cluster_id dtype: string - name: pdb_id dtype: string - name: complex dtype: protein_atom_array_feature: residue_dictionary: residue_names: - ALA - ARG - ASN - ASP - CYS - GLN - GLU - GLY - HIS - ILE - LEU - LYS - MET - PHE - PRO - SER - THR - TRP - TYR - VAL - UNK residue_types: - A - R - N - D - C - Q - E - G - H - I - L - K - M - F - P - S - T - W - Y - V - X atom_types: - N - CA - C - CB - O - CG - CG1 - CG2 - OG - OG1 - SG - CD - CD1 - CD2 - ND1 - ND2 - OD1 - OD2 - SD - CE - CE1 - CE2 - CE3 - NE - NE1 - NE2 - OE1 - OE2 - CH2 - NH1 - NH2 - OH - CZ - CZ2 - CZ3 - NZ - OXT residue_atoms: ALA: - N - CA - C - O - CB ARG: - N - CA - C - O - CB - CG - CD - NE - CZ - NH1 - NH2 ASP: - N - CA - C - O - CB - CG - OD1 - OD2 ASN: - N - CA - C - O - CB - CG - OD1 - ND2 CYS: - N - CA - C - O - CB - SG SEC: - N - CA - C - O - CB - SE GLU: - N - CA - C - O - CB - CG - CD - OE1 - OE2 GLN: - N - CA - C - O - CB - CG - CD - OE1 - NE2 GLY: - N - CA - C - O HIS: - N - CA - C - O - CB - CG - ND1 - CE1 - NE2 - CD2 ILE: - N - CA - C - O - CB - CG1 - CG2 - CD1 LEU: - N - CA - C - O - CB - CG - CD1 - CD2 LYS: - N - CA - C - O - CB - CG - CD - CE - NZ MET: - N - CA - C - O - CB - CG - SD - CE MSE: - N - CA - C - O - CB - CG - SE - CE PHE: - N - CA - C - O - CB - CG - CD1 - CD2 - CE1 - CE2 - CZ PRO: - N - CA - C - O - CB - CG - CD SER: - N - CA - C - O - CB - OG THR: - N - CA - C - O - CB - OG1 - CG2 TRP: - N - CA - C - O - CB - CG - CD1 - NE1 - CE2 - CD2 - CE3 - CZ2 - CZ3 - CH2 TYR: - N - CA - C - O - CB - CG - CD1 - CD2 - CE1 - CE2 - CZ - OH VAL: - N - CA - C - O - CB - CG1 - CG2 UNK: - N - CA - C - O backbone_atoms: - N - CA - C - O unknown_residue_name: UNK conversions: - residue: MSE to_residue: MET atom_swaps: - - SE - SD - residue: SEC to_residue: CYS atom_swaps: - - SE - SG with_res_id: true - name: apo_receptor dtype: protein_atom_array_feature: residue_dictionary: residue_names: - ALA - ARG - ASN - ASP - CYS - GLN - GLU - GLY - HIS - ILE - LEU - LYS - MET - PHE - PRO - SER - THR - TRP - TYR - VAL - UNK residue_types: - A - R - N - D - C - Q - E - G - H - I - L - K - M - F - P - S - T - W - Y - V - X atom_types: - N - CA - C - CB - O - CG - CG1 - CG2 - OG - OG1 - SG - CD - CD1 - CD2 - ND1 - ND2 - OD1 - OD2 - SD - CE - CE1 - CE2 - CE3 - NE - NE1 - NE2 - OE1 - OE2 - CH2 - NH1 - NH2 - OH - CZ - CZ2 - CZ3 - NZ - OXT residue_atoms: ALA: - N - CA - C - O - CB ARG: - N - CA - C - O - CB - CG - CD - NE - CZ - NH1 - NH2 ASP: - N - CA - C - O - CB - CG - OD1 - OD2 ASN: - N - CA - C - O - CB - CG - OD1 - ND2 CYS: - N - CA - C - O - CB - SG SEC: - N - CA - C - O - CB - SE GLU: - N - CA - C - O - CB - CG - CD - OE1 - OE2 GLN: - N - CA - C - O - CB - CG - CD - OE1 - NE2 GLY: - N - CA - C - O HIS: - N - CA - C - O - CB - CG - ND1 - CE1 - NE2 - CD2 ILE: - N - CA - C - O - CB - CG1 - CG2 - CD1 LEU: - N - CA - C - O - CB - CG - CD1 - CD2 LYS: - N - CA - C - O - CB - CG - CD - CE - NZ MET: - N - CA - C - O - CB - CG - SD - CE MSE: - N - CA - C - O - CB - CG - SE - CE PHE: - N - CA - C - O - CB - CG - CD1 - CD2 - CE1 - CE2 - CZ PRO: - N - CA - C - O - CB - CG - CD SER: - N - CA - C - O - CB - OG THR: - N - CA - C - O - CB - OG1 - CG2 TRP: - N - CA - C - O - CB - CG - CD1 - NE1 - CE2 - CD2 - CE3 - CZ2 - CZ3 - CH2 TYR: - N - CA - C - O - CB - CG - CD1 - CD2 - CE1 - CE2 - CZ - OH VAL: - N - CA - C - O - CB - CG1 - CG2 UNK: - N - CA - C - O backbone_atoms: - N - CA - C - O unknown_residue_name: UNK conversions: - residue: MSE to_residue: MET atom_swaps: - - SE - SD - residue: SEC to_residue: CYS atom_swaps: - - SE - SG with_res_id: true - name: apo_ligand dtype: protein_atom_array_feature: residue_dictionary: residue_names: - ALA - ARG - ASN - ASP - CYS - GLN - GLU - GLY - HIS - ILE - LEU - LYS - MET - PHE - PRO - SER - THR - TRP - TYR - VAL - UNK residue_types: - A - R - N - D - C - Q - E - G - H - I - L - K - M - F - P - S - T - W - Y - V - X atom_types: - N - CA - C - CB - O - CG - CG1 - CG2 - OG - OG1 - SG - CD - CD1 - CD2 - ND1 - ND2 - OD1 - OD2 - SD - CE - CE1 - CE2 - CE3 - NE - NE1 - NE2 - OE1 - OE2 - CH2 - NH1 - NH2 - OH - CZ - CZ2 - CZ3 - NZ - OXT residue_atoms: ALA: - N - CA - C - O - CB ARG: - N - CA - C - O - CB - CG - CD - NE - CZ - NH1 - NH2 ASP: - N - CA - C - O - CB - CG - OD1 - OD2 ASN: - N - CA - C - O - CB - CG - OD1 - ND2 CYS: - N - CA - C - O - CB - SG SEC: - N - CA - C - O - CB - SE GLU: - N - CA - C - O - CB - CG - CD - OE1 - OE2 GLN: - N - CA - C - O - CB - CG - CD - OE1 - NE2 GLY: - N - CA - C - O HIS: - N - CA - C - O - CB - CG - ND1 - CE1 - NE2 - CD2 ILE: - N - CA - C - O - CB - CG1 - CG2 - CD1 LEU: - N - CA - C - O - CB - CG - CD1 - CD2 LYS: - N - CA - C - O - CB - CG - CD - CE - NZ MET: - N - CA - C - O - CB - CG - SD - CE MSE: - N - CA - C - O - CB - CG - SE - CE PHE: - N - CA - C - O - CB - CG - CD1 - CD2 - CE1 - CE2 - CZ PRO: - N - CA - C - O - CB - CG - CD SER: - N - CA - C - O - CB - OG THR: - N - CA - C - O - CB - OG1 - CG2 TRP: - N - CA - C - O - CB - CG - CD1 - NE1 - CE2 - CD2 - CE3 - CZ2 - CZ3 - CH2 TYR: - N - CA - C - O - CB - CG - CD1 - CD2 - CE1 - CE2 - CZ - OH VAL: - N - CA - C - O - CB - CG1 - CG2 UNK: - N - CA - C - O backbone_atoms: - N - CA - C - O unknown_residue_name: UNK conversions: - residue: MSE to_residue: MET atom_swaps: - - SE - SD - residue: SEC to_residue: CYS atom_swaps: - - SE - SG with_res_id: true - name: pred_receptor dtype: protein_atom_array_feature: residue_dictionary: residue_names: - ALA - ARG - ASN - ASP - CYS - GLN - GLU - GLY - HIS - ILE - LEU - LYS - MET - PHE - PRO - SER - THR - TRP - TYR - VAL - UNK residue_types: - A - R - N - D - C - Q - E - G - H - I - L - K - M - F - P - S - T - W - Y - V - X atom_types: - N - CA - C - CB - O - CG - CG1 - CG2 - OG - OG1 - SG - CD - CD1 - CD2 - ND1 - ND2 - OD1 - OD2 - SD - CE - CE1 - CE2 - CE3 - NE - NE1 - NE2 - OE1 - OE2 - CH2 - NH1 - NH2 - OH - CZ - CZ2 - CZ3 - NZ - OXT residue_atoms: ALA: - N - CA - C - O - CB ARG: - N - CA - C - O - CB - CG - CD - NE - CZ - NH1 - NH2 ASP: - N - CA - C - O - CB - CG - OD1 - OD2 ASN: - N - CA - C - O - CB - CG - OD1 - ND2 CYS: - N - CA - C - O - CB - SG SEC: - N - CA - C - O - CB - SE GLU: - N - CA - C - O - CB - CG - CD - OE1 - OE2 GLN: - N - CA - C - O - CB - CG - CD - OE1 - NE2 GLY: - N - CA - C - O HIS: - N - CA - C - O - CB - CG - ND1 - CE1 - NE2 - CD2 ILE: - N - CA - C - O - CB - CG1 - CG2 - CD1 LEU: - N - CA - C - O - CB - CG - CD1 - CD2 LYS: - N - CA - C - O - CB - CG - CD - CE - NZ MET: - N - CA - C - O - CB - CG - SD - CE MSE: - N - CA - C - O - CB - CG - SE - CE PHE: - N - CA - C - O - CB - CG - CD1 - CD2 - CE1 - CE2 - CZ PRO: - N - CA - C - O - CB - CG - CD SER: - N - CA - C - O - CB - OG THR: - N - CA - C - O - CB - OG1 - CG2 TRP: - N - CA - C - O - CB - CG - CD1 - NE1 - CE2 - CD2 - CE3 - CZ2 - CZ3 - CH2 TYR: - N - CA - C - O - CB - CG - CD1 - CD2 - CE1 - CE2 - CZ - OH VAL: - N - CA - C - O - CB - CG1 - CG2 UNK: - N - CA - C - O backbone_atoms: - N - CA - C - O unknown_residue_name: UNK conversions: - residue: MSE to_residue: MET atom_swaps: - - SE - SD - residue: SEC to_residue: CYS atom_swaps: - - SE - SG with_res_id: true - name: pred_ligand dtype: protein_atom_array_feature: residue_dictionary: residue_names: - ALA - ARG - ASN - ASP - CYS - GLN - GLU - GLY - HIS - ILE - LEU - LYS - MET - PHE - PRO - SER - THR - TRP - TYR - VAL - UNK residue_types: - A - R - N - D - C - Q - E - G - H - I - L - K - M - F - P - S - T - W - Y - V - X atom_types: - N - CA - C - CB - O - CG - CG1 - CG2 - OG - OG1 - SG - CD - CD1 - CD2 - ND1 - ND2 - OD1 - OD2 - SD - CE - CE1 - CE2 - CE3 - NE - NE1 - NE2 - OE1 - OE2 - CH2 - NH1 - NH2 - OH - CZ - CZ2 - CZ3 - NZ - OXT residue_atoms: ALA: - N - CA - C - O - CB ARG: - N - CA - C - O - CB - CG - CD - NE - CZ - NH1 - NH2 ASP: - N - CA - C - O - CB - CG - OD1 - OD2 ASN: - N - CA - C - O - CB - CG - OD1 - ND2 CYS: - N - CA - C - O - CB - SG SEC: - N - CA - C - O - CB - SE GLU: - N - CA - C - O - CB - CG - CD - OE1 - OE2 GLN: - N - CA - C - O - CB - CG - CD - OE1 - NE2 GLY: - N - CA - C - O HIS: - N - CA - C - O - CB - CG - ND1 - CE1 - NE2 - CD2 ILE: - N - CA - C - O - CB - CG1 - CG2 - CD1 LEU: - N - CA - C - O - CB - CG - CD1 - CD2 LYS: - N - CA - C - O - CB - CG - CD - CE - NZ MET: - N - CA - C - O - CB - CG - SD - CE MSE: - N - CA - C - O - CB - CG - SE - CE PHE: - N - CA - C - O - CB - CG - CD1 - CD2 - CE1 - CE2 - CZ PRO: - N - CA - C - O - CB - CG - CD SER: - N - CA - C - O - CB - OG THR: - N - CA - C - O - CB - OG1 - CG2 TRP: - N - CA - C - O - CB - CG - CD1 - NE1 - CE2 - CD2 - CE3 - CZ2 - CZ3 - CH2 TYR: - N - CA - C - O - CB - CG - CD1 - CD2 - CE1 - CE2 - CZ - OH VAL: - N - CA - C - O - CB - CG1 - CG2 UNK: - N - CA - C - O backbone_atoms: - N - CA - C - O unknown_residue_name: UNK conversions: - residue: MSE to_residue: MET atom_swaps: - - SE - SD - residue: SEC to_residue: CYS atom_swaps: - - SE - SG with_res_id: true - name: receptor_uniprot_accession dtype: string - name: ligand_uniprot_accession dtype: string - name: receptor_uniprot_seq dtype: string - name: ligand_uniprot_seq dtype: string - name: receptor_resids_with_uniprot_mapping sequence: uint16 - name: receptor_mapped_uniprot_resids sequence: uint16 - name: ligand_resids_with_uniprot_mapping sequence: uint16 - name: ligand_mapped_uniprot_resids sequence: uint16 - name: oligomeric_count dtype: uint16 - name: resolution dtype: float16 - name: probability dtype: float16 - name: method dtype: string splits: - name: pinder_xl num_bytes: 416782224.0 num_examples: 1955 - name: val num_bytes: 395833402.0 num_examples: 1958 - name: pinder_s num_bytes: 58797916.0 num_examples: 250 - name: pinder_af2 num_bytes: 31709642.0 num_examples: 180 download_size: 402402974 dataset_size: 903123184.0 configs: - config_name: default data_files: - split: pinder_s path: data/pinder_s-* - split: pinder_af2 path: data/pinder_af2-* - split: pinder_xl path: data/pinder_xl-* - split: val path: data/val-* ---

该数据集信息如下: ## 特征字段 1. **id**:字符串类型,样本唯一标识符 2. **cluster_id**:字符串类型,聚类编号 3. **pdb_id**:字符串类型,蛋白质数据库(Protein Data Bank, PDB)编号 4. **complex**:蛋白质原子数组特征(protein atom array feature),其残基字典(residue dictionary)包含以下配置: - 残基名称列表:包含丙氨酸(ALA)、精氨酸(ARG)、天冬酰胺(ASN)、天冬氨酸(ASP)、半胱氨酸(CYS)、谷氨酰胺(GLN)、谷氨酸(GLU)、甘氨酸(GLY)、组氨酸(HIS)、异亮氨酸(ILE)、亮氨酸(LEU)、赖氨酸(LYS)、甲硫氨酸(MET)、苯丙氨酸(PHE)、脯氨酸(PRO)、丝氨酸(SER)、苏氨酸(THR)、色氨酸(TRP)、酪氨酸(TYR)、缬氨酸(VAL)及未知残基(UNK) - 残基类型单字母缩写:对应为A、R、N、D、C、Q、E、G、H、I、L、K、M、F、P、S、T、W、Y、V、X - 支持的原子类型:包含N、CA、C、CB、O、CG、CG1、CG2、OG、OG1、SG、CD、CD1、CD2、ND1、ND2、OD1、OD2、SD、CE、CE1、CE2、CE3、NE、NE1、NE2、OE1、OE2、CH2、NH1、NH2、OH、CZ、CZ2、CZ3、NZ、OXT等 - 残基-原子映射关系:详细定义各标准残基对应的原子集合,例如丙氨酸(ALA)包含N、CA、C、O、CB原子;精氨酸(ARG)包含N、CA、C、O、CB、CG、CD、NE、CZ、NH1、NH2原子;同时包含硒代半胱氨酸(SEC)、硒代甲硫氨酸(MSE)等特殊残基的原子定义 - 主链原子集合:固定为N、CA、C、O四类标准蛋白质主链原子 - 未知残基标识:设置为UNK - 残基转换规则:包含两条转换策略:①将硒代甲硫氨酸(MSE)转换为甲硫氨酸(MET),原子替换为SE替换SD;②将硒代半胱氨酸(SEC)转换为半胱氨酸(CYS),原子替换为SE替换SG - 启用残基编号:字段值为真 5. **apo_receptor**:蛋白质原子数组特征,结构与上述complex字段完全一致,代表未结合配体的空载受体蛋白 6. **apo_ligand**:蛋白质原子数组特征,结构与上述complex字段完全一致,代表未结合配体的空载配体(符合数据集原始定义) 7. **pred_receptor**:蛋白质原子数组特征,结构与上述complex字段完全一致,代表预测得到的受体蛋白 8. **pred_ligand**:蛋白质原子数组特征,结构与上述complex字段完全一致,代表预测得到的配体 9. **receptor_uniprot_accession**:字符串类型,受体蛋白的UniProt(Universal Protein Resource)登录号 10. **ligand_uniprot_accession**:字符串类型,配体的UniProt登录号 11. **receptor_uniprot_seq**:字符串类型,受体蛋白的UniProt参考序列 12. **ligand_uniprot_seq**:字符串类型,配体的UniProt参考序列 13. **receptor_resids_with_uniprot_mapping**:无符号16位整数序列,代表带有UniProt序列映射的受体残基编号集合 14. **receptor_mapped_uniprot_resids**:无符号16位整数序列,代表映射至UniProt参考序列的受体残基编号集合 15. **ligand_resids_with_uniprot_mapping**:无符号16位整数序列,代表带有UniProt序列映射的配体残基编号集合 16. **ligand_mapped_uniprot_resids**:无符号16位整数序列,代表映射至UniProt参考序列的配体残基编号集合 17. **oligomeric_count**:无符号16位整数,靶标复合物的寡聚体计数 18. **resolution**:半精度浮点数,对应复合物结构解析的分辨率 19. **probability**:半精度浮点数,模型预测置信概率 20. **method**:字符串类型,复合物结构解析采用的实验方法(如X射线衍射、冷冻电镜等) ## 数据集拆分 该数据集共包含4个拆分子集: - `pinder_xl`拆分:总数据量416782224.0字节,包含1955个样本 - `val`(验证集)拆分:总数据量395833402.0字节,包含1958个样本 - `pinder_s`拆分:总数据量58797916.0字节,包含250个样本 - `pinder_af2`拆分:总数据量31709642.0字节,包含180个样本 整体数据集总下载大小为402402974字节,总存储大小为903123184字节 ## 数据集配置 默认配置(default)对应的数据文件路径如下: - `pinder_s`拆分:对应`data/pinder_s-*` - `pinder_af2`拆分:对应`data/pinder_af2-*` - `pinder_xl`拆分:对应`data/pinder_xl-*` - `val`拆分:对应`data/val-*`
提供机构:
graph-transformers
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作