chandar-lab/UR100P

Name: chandar-lab/UR100P
Creator: chandar-lab
Published: 2024-10-16 18:31:25
License: 暂无描述

Hugging Face2024-10-16 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/chandar-lab/UR100P

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含来自多个来源的蛋白质序列，用于训练和评估蛋白质语言模型AMPLIFY。数据来源包括UniProt、Observed Antibody Space (OAS)和Structural Classification of Proteins (SCOP) version 2数据库。数据集在2023年12月收集，并移除了包含模糊氨基酸的序列。对于OAS，仅包含配对的重链和轻链序列，并通过在重链|轻链（Hc|Lc）和轻链|重链（Lc|Hc）链排列中引入链断裂标记`|`来增强数据集。使用MMseqs2过滤掉训练集中与验证集序列相似度超过90%的序列，以防止数据泄漏并确保公平评估。

This dataset contains curated protein sequences from multiple sources and has been used to train and evaluate the efficient state-of-the-art protein language model AMPLIFY. It combines data from UniProt, the Observed Antibody Space (OAS), and the Structural Classification of Proteins version 2 (SCOP 2) databases to enable task-specific validation of the models. All data were collected in December 2023, and sequences containing ambiguous amino acids (B, J, O, U, X, Z) were removed. For OAS, only paired heavy and light chain sequences were included, and the dataset was augmented by incorporating sequences in both heavy|light (Hc|Lc) and light|heavy (Lc|Hc) chain arrangements, separated by a chainbreak token `|`. MMseqs2 was used to filter out sequences in the train sets with >90% sequence identity to the validation sets, preventing data leakage and ensuring a fair evaluation.

提供机构：

chandar-lab

5,000+

优质数据集

54 个

任务类型

进入经典数据集