ProteinLMDataset
收藏arXiv2024-06-09 更新2024-06-12 收录
下载链接:
https://huggingface.co/datasets/tsynbio/ProteinLMBench
下载链接
链接失效反馈官方服务:
资源简介:
ProteinLMDataset是由上海人工智能实验室创建的大规模蛋白质序列与文本混合数据集,旨在通过自监督学习和监督微调提升大型语言模型对蛋白质序列的理解能力。该数据集包含174.6亿个用于自监督学习的tokens和89.3万个用于监督微调的指令。数据集内容涵盖蛋白质序列与英文文本对,以及中文与英文科学文本对,确保了数据集的多样性和广泛适用性。创建过程中,数据集通过整合多种生物信息资源,如UniProtKB和PubMed,精心构建了蛋白质序列与文本的对应关系。ProteinLMDataset的应用领域主要集中在蛋白质科学和生物工程,旨在解决蛋白质序列理解和分析中的复杂问题。
ProteinLMDataset is a large-scale mixed dataset of protein sequences and text developed by the Shanghai AI Laboratory, which aims to enhance the ability of large language models (LLMs) to understand protein sequences through self-supervised learning and supervised fine-tuning. This dataset comprises 17.46 billion tokens for self-supervised learning and 893,000 instructions for supervised fine-tuning. Its content includes paired protein sequences and English texts, as well as paired Chinese and English scientific texts, ensuring the dataset's diversity and broad applicability. During its construction, the dataset meticulously establishes the correspondence between protein sequences and texts by integrating multiple bioinformatics resources such as UniProtKB and PubMed. The application fields of ProteinLMDataset primarily concentrate on protein science and bioengineering, with the aim of addressing complex issues in protein sequence understanding and analysis.
提供机构:
上海人工智能实验室
创建时间:
2024-06-09



