five

Learning the Rules of Peptide Self-assembly through Data Mining with Large Language Models

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14787834
下载链接
链接失效反馈
官方服务:
资源简介:
Peptides are biologically ubiquitous and important molecules that self-assemble into diverse structures. While extensive research has explored the effects of chemical composition and environmental conditions on self-assembly, a systematic study consolidating this data to uncover global rules is lacking. In this work, we curate a peptide assembly database through a combination of manual processing by human experts and literature mining with a large language model. As a result, we collect more than 1,000 experimental data entries with information about peptide sequence, experimental conditions and corresponding self-assembly phases. Utilizing the data, machine learning models are trained and evaluated, demonstrating excellent accuracy (> 80%) and efficiency in assembly phase classification. Moreover, we fine-tune our GPT model for peptide literature mining with the developed dataset, which exhibits markedly superior performance in extracting information from academic publications relative to the pre-trained model. This workflow can improve efficiency when exploring potential self-assembling peptide candidates, through guiding experimental work, while also deepening our understanding of the mechanisms governing peptide self-assembly.   --- phase_data_clean.csv stores 1000+ peptide self-assembly data under different experimental conditions. ---mined_paper_list.csv stores the corresponding papers we used to collect data. --- trainset.jsonl and testset.jsonl are data we used for fine-tuning the LLM.  --- fine-tuning.ipynb: code used to fine-tune ChatGPT model.  --- pretrain.ipynb: code used to test the pretrained ChatGPT model. --- train_and_inference.ipynb: code to use mined data to train and test a ML predictor for phase classification.
创建时间:
2025-03-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作