Main dataset for the Large-scale analysis of the β-lactamase sequence space with protein language models

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/14743324

下载链接

链接失效反馈

官方服务：

资源简介：

The main dataset for the publication "Large-scale analysis of the β-lactamase sequence space with protein language models". This dataset contains 29,445 rows and 82 columns and is provided in parquet format. The rows represent all sequences retrieved from the BLDB. The columns contain information processed from the BLDB, including their taxonomy annotated against the Genome Taxonomy Database (GTDB RS207), the per-protein embeddings derived from five protein language models (ESM-1b, ESM2-650, ESM2-3b, CARP-640M, ProtTrans-t5-xl-u50), functional annotations estimated with Biopython, sequence quality filters applied to select sequences for the analysis, annotations from the AlphaFold Database (AFDB) for the available structures, and the secondary structure annotations generated from the predicted structures by AlphaFold2 using pyDSSP. The 2-dimensional representations of PCA, t-SNE, and UMAP for the evaluated protein language models are provided as datasets in CSV and Parquet formats. The algorithm used and the specific set of beta-lactamases are indicated at the beginning of the filename: sbl for serine beta-lactamases and mbl for metallo-beta-lactamases. For more information, consult the following Github repository https://github.com/miangoar/Betalactamase-analysis-with-machine-learning

创建时间：

2025-01-26