FAPM: Functional annotation of proteins using multi-modal models beyond structural modeling
收藏DataONE2024-07-16 更新2024-07-27 收录
下载链接:
https://search.dataone.org/view/sha256:3109cbde82aec70d1f9ebc5c91132588916b42fc29cbb08005dc7a0af3366692
下载链接
链接失效反馈官方服务:
资源简介:
Assigning accurate property labels to proteins, like functional terms and catalytic activity, is challenging, especially for proteins without homologs and âtail labelsâ with few known examples. Unlike previous methods that mainly focused on protein sequence features, we use a pretrained large natural language model to understand the semantic meaning of protein labels. Specifically, we introduce FAPM, a contrastive multi-modal model that links natural language with protein sequence language. This model combines a pretrained protein sequence model with a pretrained large language model to generate labels, such as Gene Ontology (GO) functional terms and catalytic activity predictions, in natural language. Our results show that FAPM excels in understanding protein properties, outperforming models based solely on protein sequences or structures. It achieves state-of-the-art performance on public benchmarks and in-house experimentally annotated phage proteins, which often have few known homol..., , , # FAPM: Functional annotation of proteins using multi-modal models beyond structural modeling
[https://doi.org/10.5061/dryad.m905qfv9p](https://doi.org/10.5061/dryad.m905qfv9p)
The online demo is at: [https://huggingface.co/spaces/wenkai/FAPM_demo](https://huggingface.co/spaces/wenkai/FAPM_demo)
### Description of the data and file structure
The dataset includes:
1. The information of GO (Gene Ontology). This is a system to describe the functions of proteins.Â
-The basic version of the GO (file name: go1.4-basic.obo). Source: [https://geneontology.org/docs/download-ontology/](https://geneontology.org/docs/download-ontology/)
-The mapping between GO numbers and GO descriptions (file name: go_descriptions1.4.txt)
-GO terms (file names: bp_terms.pkl; mf_terms.pkl; cc_terms.pkl)
2. Manually annotated data derived from Uniprot database. These datasets are used to finetune the model.
-File names:
train_exp_prompt_bp.csv; train_exp_prompt_mf.csv; train_exp_prompt_cc.cs...
创建时间:
2024-07-17



