five

UltraScan Solution Modeler (US-SOMO) hydrodynamic parameter, structural small angle scattering and SESCA circular dichroism (CD) calculations on AlphaFold predicted structures

收藏
DataONE2023-01-14 更新2024-06-08 收录
下载链接:
https://search.dataone.org/view/sha256:4035d7f4fef6cd4f16246395d4359f16c77137960acbbda90585eeffe85bee85
下载链接
链接失效反馈
官方服务:
资源简介:
Recent spectacular advances by AI programs in 3D structure predictions from protein sequences have revolutionized the field in terms of accuracy and speed. The resulting \"folding frenzy\" has already produced predicted protein structure databases for the entire human and other organisms' proteomes. However, rapidly ascertaining a predicted structure's reliability based on measured properties in solution should be considered. Shape-sensitive hydrodynamic parameters such as the diffusion and sedimentation coefficients (D0t(20,w),s0(20,w)) and the intrinsic viscosity ([η]) can provide a rapid assessment of the overall structure likeliness, and SAXS would yield the structure-related pair-wise distance distribution function p(r) vs. r. Using the extensively validated UltraScan SOlution MOdeler (US‑SOMO) suite, a database was implemented calculating from AlphaFold structures the corresponding D0t(20,w), s0(20,w), [η], p(r) vs. r, and other parameters. Circular dichroism spectra were computed u..., Production of this dataset required three major steps: collect the AlphaFold entries and additional metadata; prepare the structures for hydrodynamic, structural and CD calculations; and compute the hydrodynamic, structural and CD propertiesBriefly, each entry in the entire AlphaFold database was first compared with the corresponding entry in the UniProt database to find the (putative) initiator methionine, signal peptide and transit peptide regions, which were subsequently removed from the AlphaFold PDB files. Additional variants were created when propeptides were found. Potential disulfides were identified (subsequently allowing a better evaluation of the partial specific volume and of M) and written as SSBOND records in the cured PDBs, together with HELIX and SHEET information identified using the DSSP implementation in UCSF Chimera (Pettersen et al, 2004. Journal of computational chemistry, 25(13), pp.1605-1612). Batch-mode US-SOMO was then used to calculate the mass M, The translat..., This is a tar archive of all datasets for each AlphaFold entry. This includes a csv file containing all hydrodynamic parameters, a pdb file containing the cured pdb structure, an mmCIF file containing the cured pdb structure and a data file containing the circular dichroism spectrum, and a p(r) vs r dat file.Use \"tar xf somoaf_all_data.tar\" to extract the primary archive.This will result in 1,002,038 individual .txz file, each representing one UniProt accession code and containing 5 files.When propepties are identified and removed, the extracted file name will contain a -pp#, where # is a list of the propepties removed.For example, to extract the data from an individual txz file, use \"tar Jxf xxxx.txz\", where xxxx is replaced by the appropriate name containing the accession code. Further details are in the provided README.md file.

近年来,人工智能程序在蛋白质序列三维结构预测领域取得了瞩目的突破,从精度与速度层面彻底重塑了该研究领域。由此催生的“蛋白质折叠研究热潮”,已针对完整人类及其他物种的蛋白质组构建了预测蛋白质结构数据库。然而,仍需考虑基于溶液中实测特性快速评估预测结构可靠性的需求。对结构敏感的流体力学参数,如扩散系数(D0t(20,w))、沉降系数(s0(20,w))与特性黏度([η]),可快速评估整体结构的合理性;而小角X射线散射(Small-Angle X-ray Scattering, SAXS)则可得到与结构相关的两两距离分布函数p(r)随半径r的变化关系。依托经广泛验证的UltraScan溶液建模器(UltraScan SOlution MOdeler, US‑SOMO)工具集,本研究构建了一个数据库,可从AlphaFold蛋白质结构中计算得到对应的D0t(20,w)、s0(20,w)、[η]、p(r)随r变化关系及其他相关参数。圆二色光谱(Circular Dichroism, CD)的计算采用了相同工具集。本数据集的构建包含三大核心步骤:收集AlphaFold数据库条目及相关元数据;为流体力学、结构与圆二色光谱计算预处理蛋白质结构;并最终计算上述流体力学、结构与圆二色光谱相关属性。 简言之,首先将AlphaFold数据库中的所有条目与UniProt(Universal Protein)数据库的对应条目进行比对,以识别(推定的)起始甲硫氨酸、信号肽与转运肽区域,并将这些区域从AlphaFold的PDB(Protein Data Bank)文件中移除。当检测到前肽序列时,将生成额外的结构变体。研究人员鉴定了潜在的二硫键(这有助于更准确地评估偏微比容与分子量M),并将其以SSBOND记录的形式写入预处理后的PDB文件中,同时还添加了通过UCSF Chimera(Pettersen等,2004,《计算化学杂志》,第25卷第13期,第1605-1612页)的DSSP(Dictionary of Secondary Structure of Proteins)工具识别得到的螺旋(HELIX)与折叠(SHEET)二级结构信息。随后采用批处理模式的US-SOMO工具集计算蛋白质分子量M,[原文此处存在截断]。本数据集为AlphaFold所有数据库条目的tar归档文件,包含以下文件:包含全部流体力学参数的CSV文件、经过预处理的PDB格式结构文件、mmCIF(macromolecular Crystallographic Information File)格式结构文件、圆二色光谱数据文件,以及p(r)随半径r变化关系的dat数据文件。可通过"tar xf somoaf_all_data.tar"命令解压主归档文件,解压后将得到1,002,038个独立的.txz文件,每个文件对应一个UniProt登录号,且包含5个子文件。当鉴定并移除前肽序列时,解压得到的文件名将包含"-pp#"后缀,其中"#"代表被移除的前肽序列列表。例如,若要解压单个txz文件中的数据,请使用命令"tar Jxf xxxx.txz",其中"xxxx"替换为对应UniProt登录号的文件名。更多详细信息请参见附带的README.md文件。
创建时间:
2025-07-17
二维码
社区交流群
二维码
科研交流群
商业服务