five

Dataset used to train ProteinCLIP

收藏
Mendeley Data2024-06-27 更新2024-06-27 收录
下载链接:
https://zenodo.org/records/11176863
下载链接
链接失效反馈
官方服务:
资源简介:
This contains embeddings for UniProt records used to train ProteinCLIP. This includes both embeddings from protein language models and natural language emebeddings of function. All records are stored in hdf5 files with identifiers as keys (e.g., P38398) and embedding arrays as their values. All embeddings are given as 1-dimensional vectors. The following protein language models have associated embeddings: ESM2, 6-layer ESM2, 12-layer ESM2, 30-layer ESM2, 33-layer ESM2, 36-layer ProtT5 Some of the above protein embeddings are split into sub-file indicated by the suffix "_splitN"; these files when concatenated yield the full set of proteins and are created separately due to parallel processing of embeddings. Text embeddings are generated by OpenAI's "text-embedding-3-large" model. We also include the raw files used to create these embeddings. Namely the ".dat.gz" file contains the archive of UniProt annotations including the function fields we parse for creating function text embeddings, and ".fasta.gz" file containing corresponding sequences.
创建时间:
2024-05-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作