esm2_uniref_pretraining_data

Name: esm2_uniref_pretraining_data
Creator: maas
Published: 2025-12-04 09:19:27
License: 暂无描述

魔搭社区2025-12-04 更新2025-10-04 收录

下载链接：

https://modelscope.cn/datasets/nv-community/esm2_uniref_pretraining_data

下载链接

链接失效反馈

官方服务：

资源简介：

# ESM-2 Uniref Pretraining Data ## Dataset Description: UniRef, or UniProt Reference Clusters, are databases of clustered protein sequences from the UniProt Knowledgebase (UniProtKB) that group similar sequences to reduce redundancy and make data easier to work with for biological research. It offers different levels of clustering (UniRef100, UniRef90, and UniRef50) based on sequence identity, with each cluster containing a representative sequence, a count of member proteins, and links to detailed functional annotations in the UniProtKB. We are releasing a subset of UniRef (UniRef50 \+ UniRef90) that was used for pretraining ESM-2nv models, with the following modifications. We removed the artificial sequences from UniRef50 and UniRef90 and created our own training and validation sets. We further performed MMseqs (Many-against-Many sequence searching) clustering on these datasets. This dataset is ready for commercial/non-commercial use. ## Dataset Owner(s): The UniRef dataset is owned and maintained by the UniProt Consortium, a collaboration between three major bioinformatics institutes: European Bioinformatics Institute (EBI), SIB Swiss Institute of Bioinformatics, and Protein Information Resource (PIR). ## Dataset Creation Date: July 19, 2024\. ## License/Terms of Use: Governing Terms: This dataset is licensed under the [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/legalcode.en) (CC BY 4.0). ## Intended Usage: ESM2-nv is using this dataset for model pretraining. This dataset can be used by protein designers, structural biologists, bioengineers, computational biologists and protein engineers for pretraining other similar models. ## Dataset Characterization **Data Collection Method** * Human **Labeling Method** * N/A ## Dataset Format The dataset is provided in the standard FASTA format, with one entry for each representative protein sequence from the UniRef90 and UniRef50 clusters. Each entry consists of a header line and the protein sequence itself. ## Dataset Quantification 187,382,018 training sequences, chosen from UniRef90 representative sequences. 328,360 validation sequences, chosen from UniRef50 representative sequences. The total data storage is approximately 35GB. ## Reference(s): 1. [Uniprot Reference Clusters (UniRef)](https://www.uniprot.org/uniref) 2. [ESM-2nv 650M](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/models/esm2nv650m) 3. [ESM-2nv 3B](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/models/esm2nv3b) 4. Original ESM-2 Paper: Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y., dos Santos Costa, A., Fazel-Zarandi, M., Sercu, T., Candido, S., & Rives, A. (2023). Evolutionary-scale prediction of atomic level protein structure with a language model. *Science*, *379*(6637), eade2574. [https://doi.org/10.1126/science.ade2574](https://doi.org/10.1126/science.ade2574) ## Ethical Considerations: NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).

# ESM-2 Uniref 预训练数据集 ## 数据集说明： UniRef即UniProt参考聚类（UniProt Reference Clusters），是源自UniProt知识库（UniProt Knowledgebase，UniProtKB）的聚类蛋白质序列数据库，通过对相似序列进行聚类以降低数据冗余，便于生物研究中的数据分析。该数据库基于序列一致性提供不同聚类层级（UniRef100、UniRef90与UniRef50），每个聚类均包含一条代表序列、成员蛋白数量，以及指向UniProtKB中详细功能注释的链接。本次发布的是用于预训练ESM-2nv模型的UniRef子集（UniRef50与UniRef90），并对其进行了如下优化：我们移除了UniRef50与UniRef90中的人工序列，自行构建了训练集与验证集；此外，我们还对该数据集执行了MMseqs（多对多序列搜索，Many-against-Many sequence searching）聚类操作。本数据集可用于商业与非商业用途。 ## 数据集所有者： UniRef数据集由UniProt联盟（UniProt Consortium）所有并维护，该联盟由欧洲生物信息学研究所（European Bioinformatics Institute，EBI）、瑞士SIB生物信息学研究所（SIB Swiss Institute of Bioinformatics）以及蛋白质信息资源中心（Protein Information Resource，PIR）三大生物信息学机构合作组建。 ## 数据集创建日期： 2024年7月19日 ## 使用许可条款：本数据集采用[知识共享署名4.0国际许可协议](https://creativecommons.org/licenses/by/4.0/legalcode.en)（Creative Commons Attribution 4.0 International License，CC BY 4.0）进行授权。 ## 预期用途：本数据集目前用于ESM2-nv模型的预训练。同时，蛋白质设计师、结构生物学家、生物工程师、计算生物学家与蛋白质工程师也可使用该数据集预训练其他同类模型。 ## 数据集特征 **数据采集方式**：人工 **标注方式**：无（N/A） ## 数据集格式：本数据集采用标准FASTA格式进行存储，每条记录对应UniRef90与UniRef50聚类中的一条代表蛋白序列，每条记录由标题行与蛋白序列本体组成。 ## 数据集量化统计：训练集共包含187,382,018条序列，均选自UniRef90的代表序列；验证集共包含328,360条序列，均选自UniRef50的代表序列。数据集总存储量约为35GB。 ## 参考文献： 1. [UniProt参考聚类（UniRef）](https://www.uniprot.org/uniref) 2. [ESM-2nv 650M](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/models/esm2nv650m) 3. [ESM-2nv 3B](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/models/esm2nv3b) 4. ESM-2原创论文：Lin, Z.、Akin, H.、Rao, R.、Hie, B.、Zhu, Z.、Lu, W.、Smetanin, N.、Verkuil, R.、Kabeli, O.、Shmueli, Y.、dos Santos Costa, A.、Fazel-Zarandi, M.、Sercu, T.、Candido, S.与Rives, A.（2023年）。基于语言模型的原子级蛋白质结构进化尺度预测。《科学》（*Science*），第379卷第6637期，eade2574。[https://doi.org/10.1126/science.ade2574](https://doi.org/10.1126/science.ade2574) ## 伦理考量：英伟达（NVIDIA）认为可信人工智能是一项共同责任，我们已制定相关政策与实践规范，以支持各类人工智能应用的开发。开发者在按照服务条款下载或使用本数据集时，应与内部模型团队协作，确保该模型符合相关行业与应用场景的要求，并防范潜在的产品滥用风险。请在此处提交安全漏洞报告或英伟达人工智能相关问题反馈：[https://www.nvidia.com/en-us/support/submit-security-vulnerability/](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)

提供机构：

maas

创建时间：

2025-09-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集