five

FAPM: Functional annotation of proteins using multi-modal models beyond structural modeling

收藏
DataONE2024-07-16 更新2024-07-27 收录
下载链接:
https://search.dataone.org/view/sha256:3109cbde82aec70d1f9ebc5c91132588916b42fc29cbb08005dc7a0af3366692
下载链接
链接失效反馈
官方服务:
资源简介:
Assigning accurate property labels to proteins, like functional terms and catalytic activity, is challenging, especially for proteins without homologs and “tail labels” with few known examples. Unlike previous methods that mainly focused on protein sequence features, we use a pretrained large natural language model to understand the semantic meaning of protein labels. Specifically, we introduce FAPM, a contrastive multi-modal model that links natural language with protein sequence language. This model combines a pretrained protein sequence model with a pretrained large language model to generate labels, such as Gene Ontology (GO) functional terms and catalytic activity predictions, in natural language. Our results show that FAPM excels in understanding protein properties, outperforming models based solely on protein sequences or structures. It achieves state-of-the-art performance on public benchmarks and in-house experimentally annotated phage proteins, which often have few known homol..., , , # FAPM: Functional annotation of proteins using multi-modal models beyond structural modeling [https://doi.org/10.5061/dryad.m905qfv9p](https://doi.org/10.5061/dryad.m905qfv9p) The online demo is at: [https://huggingface.co/spaces/wenkai/FAPM_demo](https://huggingface.co/spaces/wenkai/FAPM_demo) ### Description of the data and file structure The dataset includes: 1. The information of GO (Gene Ontology). This is a system to describe the functions of proteins.  -The basic version of the GO (file name: go1.4-basic.obo). Source: [https://geneontology.org/docs/download-ontology/](https://geneontology.org/docs/download-ontology/) -The mapping between GO numbers and GO descriptions (file name: go_descriptions1.4.txt) -GO terms (file names: bp_terms.pkl; mf_terms.pkl; cc_terms.pkl) 2. Manually annotated data derived from Uniprot database. These datasets are used to finetune the model. -File names: train_exp_prompt_bp.csv; train_exp_prompt_mf.csv; train_exp_prompt_cc.cs...
创建时间:
2024-07-17
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作