Learning the Rules of Peptide Self-assembly through Data Mining with Large Language Models

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/14787834

下载链接

链接失效反馈

官方服务：

资源简介：

Peptides are biologically ubiquitous and important molecules that self-assemble into diverse structures. While extensive research has explored the effects of chemical composition and environmental conditions on self-assembly, a systematic study consolidating this data to uncover global rules is lacking. In this work, we curate a peptide assembly database through a combination of manual processing by human experts and literature mining with a large language model. As a result, we collect more than 1,000 experimental data entries with information about peptide sequence, experimental conditions and corresponding self-assembly phases. Utilizing the data, machine learning models are trained and evaluated, demonstrating excellent accuracy (> 80%) and efficiency in assembly phase classification. Moreover, we fine-tune our GPT model for peptide literature mining with the developed dataset, which exhibits markedly superior performance in extracting information from academic publications relative to the pre-trained model. This workflow can improve efficiency when exploring potential self-assembling peptide candidates, through guiding experimental work, while also deepening our understanding of the mechanisms governing peptide self-assembly. --- phase_data_clean.csv stores 1000+ peptide self-assembly data under different experimental conditions. ---mined_paper_list.csv stores the corresponding papers we used to collect data. --- trainset.jsonl and testset.jsonl are data we used for fine-tuning the LLM. --- fine-tuning.ipynb: code used to fine-tune ChatGPT model. --- pretrain.ipynb: code used to test the pretrained ChatGPT model. --- train_and_inference.ipynb: code to use mined data to train and test a ML predictor for phase classification.

创建时间：

2025-03-22

5,000+

优质数据集

54 个

任务类型

进入经典数据集