HoarfrostLab/UniprotAndSwissprotDatasets
收藏Hugging Face2025-01-05 更新2025-04-26 收录
下载链接:
https://hf-mirror.com/datasets/HoarfrostLab/UniprotAndSwissprotDatasets
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是从UniProt和SwissProt数据库派生出来的,用于分析生物序列。数据包括原始序列、处理过的文件以及通过分层策略分割的基准数据集,以确保多样性和减少模型偏差。数据集分为平衡和不平衡的分割,还有将完整序列分割成更小的序列的额外变体。数据来源于UniProt的TrEMBL和SwissProt部分,经过过滤、映射和清洗步骤处理。数据集分为训练集、验证集和测试集,基于UniRef50、UniRef90或UniRef100簇衍生。还包括四个基准数据集,分别针对不同的数据平衡情况。序列被分割成小片段的数据集特别适用于序列对齐和基于相似性的任务。
This dataset is derived from UniProt and SwissProt databases for analyzing biological sequences. It includes raw sequences, processed files, and benchmark datasets split using hierarchical strategies to ensure diversity and reduce model bias. The datasets are organized into balanced and unbalanced splits, with additional variations where full-length sequences are chunked into smaller sequences. The data is sourced from the TrEMBL and SwissProt sections of UniProt, processed through filtering, mapping, and cleaning steps. The datasets are split into training, validation, and test sets based on UniRef50, UniRef90, or UniRef100 clusters. There are also four benchmark datasets for different data balancing scenarios. The chunked sequence datasets are particularly useful for sequence alignment and similarity-based tasks.
提供机构:
HoarfrostLab



