Foundation-Peptidomimetics-Language-Model
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14720107
下载链接
链接失效反馈官方服务:
资源简介:
This dataset aim to advance the systematic study and design of peptidomimetics by leveraging non-canonical elements. We extracted over 17,000 non-canonical elements, including non-canonical amino acids and terminal modifications, from peptidomimetics available in the ChEMBL database. These elements have been standardized to facilitate their representation in a sequence-like format, providing a foundation for consistent analysis and design.
We developed a foundational language model, GPepT, trained on peptides and peptidomimetics. This model, hosted on HuggingFace (https://huggingface.co/Playingyoyo/GPepT), allows users to design novel peptidomimetics efficiently. The combination of the standardized dataset and GPepT makes it easier to explore, analyze, and generate new peptidomimetic sequences with enhanced scientific precision.
Files Included:
dictionary.txtA comprehensive dictionary of elements (amino acids and terminal modifications) with the following features:
standardized IDs (e.g., canonical amino acids follow the one-letter code; non-canonical amino acids start with "X", terminal modifications with "Z").
SMILES representation
tautomeric SMILES
peptide bond sites
functional groups
some physiochemical properties
frequency
datasetP.txtA dataset of peptidomimetics extracted from ChEMBL, encoded using the standardized vocabulary. Contains:
ChEMBL ID
sequence representation
SMILES representation
length
Some physiochemical properties
peptidomimetics_wetlab.txtSequences (Pep1~pep5) generated by GPepT that were used for experimental validation.
Pep1_activity.txtAntimicrobial activity of Pep1 against E. coli.
创建时间:
2025-01-22



