Data related to Panzer: A Machine Learning Based Approach to Analyze Supersecondary Structures of Proteins

Name: Data related to Panzer: A Machine Learning Based Approach to Analyze Supersecondary Structures of Proteins
Creator: DaRUS
Published: 2024-11-27 00:00:00
License: 暂无描述

doi.org2024-11-27 更新2025-01-21 收录

下载链接：

https://doi.org/10.18419/darus-4576

下载链接

链接失效反馈

官方服务：

资源简介：

This entry contains the data used to implement the bachelor thesis. It was investigated how embeddings can be used to analyze supersecondary structures. Abstract of the thesis: This thesis analyzes the behavior of supersecondary structures in the context of embeddings. For this purpose, data from the Protein Topology Graph Library was provided with embeddings. This resulted in a structured graph database, which will be used for future work and analyses. In addition, different projections were made into the two-dimensional space to analyze how the embeddings behave there. In the Jupyter Notebook 1_data_retrival.ipynb the download process of the graph files from the Protein Topology Graph Library (https://ptgl.uni-frankfurt.de) can be found. The downloaded .gml files can also be found in graph_files.zip. These form graphs that represent the relationships of supersecondary structures in the proteins. These form the data basis for further analyses. These graph files are then processed in the Jupyter Notebook 2_data_storage_and_embeddings.ipynb and entered into a graph database. The sequences of the supersecondary and secondary structures from the PTGL can be found in fastas.zip. The embeddings were also calculated using the ESM model of the Facebook Research Group (huggingface.co/facebook/esm2_t12_35M_UR50D), which can be found in three .h5 files. These are then added there subsequently. The whole process in this notebook serves to build up the database, which can then be searched using Cypher querys. In the Jupyter Notebook 3_data_science.ipynb different visualizations and analyses are then carried out, which were made with the help of UMAP. For the installation of all dependencies, it is recommended to create a Conda environment and then install all packages there. To use the project, PyEED should be installed using the snapshot of the original repository (source repository: https://github.com/PyEED/pyeed). The best way to install PyEED is to execute the pip install -e . command in the pyeed_BT folder. The dependencies can also be installed by using poetry and the .toml file. In addition, seaborn, h5py and umap-learn are required. These can be installed using the following commands: pip install h5py==3.12.1 pip install seaborn==0.13.2 umap-learn==0.5.7

本条目包含了用于实现学士论文所需的数据。论文探讨了如何利用嵌入技术来分析超二级结构。论文摘要如下：本论文在嵌入的语境下，对超二级结构的行为进行了分析。为此，提供了来自蛋白质拓扑图库（Protein Topology Graph Library）的数据，并结合嵌入技术，构建了一个结构化的图数据库，该数据库将用于未来的工作和分析。此外，还进行了二维空间中的不同投影，以分析嵌入在该空间中的行为。在 Jupyter Notebook '1_data_retrival.ipynb' 中，可以找到从蛋白质拓扑图库（https://ptgl.uni-frankfurt.de）下载图文件的过程。下载的 .gml 文件也包含在 graph_files.zip 文件中。这些文件构成了表示蛋白质中超二级结构关系的图。这些图构成了进一步分析的数据基础。随后，在 Jupyter Notebook '2_data_storage_and_embeddings.ipynb' 中对这些图文件进行处理，并将其录入图数据库。来自 PTGL 的超二级结构和二级结构的序列可从 fastas.zip 文件中找到。嵌入计算使用了 Facebook 研究组（huggingface.co/facebook/esm2_t12_35M_UR50D）的 ESM 模型，结果保存在三个 .h5 文件中，随后添加至数据库中。该笔记本中的整个过程旨在构建数据库，该数据库可以通过 Cypher 查询进行搜索。在 Jupyter Notebook '3_data_science.ipynb' 中，通过 UMAP 的辅助进行了不同的可视化和分析。为了安装所有依赖项，建议创建一个 Conda 环境，并在其中安装所有软件包。要使用此项目，应使用原始仓库的快照安装 PyEED（source repository: https://github.com/PyEED/pyeed）。安装 PyEED 的最佳方式是在 pyeed_BT 文件夹中执行 pip install -e . 命令。依赖项也可以通过使用 poetry 和 .toml 文件进行安装。此外，还需要 seaborn、h5py 和 umap-learn。这些可以通过以下命令进行安装：pip install h5py==3.12.1 pip install seaborn==0.13.2 umap-learn==0.5.7

提供机构：

DaRUS

5,000+

优质数据集

54 个

任务类型

进入经典数据集