five

Foldclass databases for protein structural domains in CATH and TED

收藏
rdr.ucl.ac.uk2024-12-04 更新2025-01-22 收录
下载链接:
https://rdr.ucl.ac.uk/articles/dataset/Foldclass_databases_for_protein_structural_domains_in_CATH_and_TED/26348605/1
下载链接
链接失效反馈
官方服务:
资源简介:
This repository contains databases of protein domains for use with Foldclass and Merizo-search. We provide databases for all 365 million domains in TED, as well as all classified domains in CATH 4.3.Foldclass and Merizo-search use two formats for databases. The default format uses a PyTorch tensor and a pickled list of Python tuples to store the data. This format is used for the CATH database, which is small enough to fit in memory. For larger-than-memory datasets, such as TED, we use a binary format that is searched using the Faiss library.The CATH database requires approximately 1.4 GB of disk space, whereas the TED database requires about 885 GB. Please ensure you have enough free storage space before downloading. For best search performance with the TED database, the database should be stored on the fastest storage hardware available to you.IMPORTANT:We recommend going in to each folder and downloading the files; if you attempt to download each folder in one go, it will download a zip file which will need to be decompressed. This is particularly an issue if downloading the TED database, as you will need to have roughly twice the storage space needed as compared to downloading the individual files. Our GitHub repository (see Related Materials below) contains a convenience script to download each database; we recommend using that.

本存储库收录了用于Foldclass及Merizo-search工具的蛋白质结构域数据库。我们提供了TED数据库中全部3.65亿个结构域以及CATH 4.3版本中所有分类结构域的数据库。Foldclass和Merizo-search工具针对数据库采用了两种格式。默认格式采用PyTorch张量与Python元组序列化的列表进行数据存储。此格式适用于CATH数据库,因其规模适中,可完整装入内存。对于超出内存大小的数据集,如TED数据库,我们则采用Faiss库支持的二进制格式进行搜索。CATH数据库约占1.4GB的磁盘空间,而TED数据库则需要大约885GB。请在下载前确认您拥有足够的可用存储空间。为了实现TED数据库的最佳搜索性能,建议将数据库存储在您可获得的最高速存储硬件上。重要提示:我们建议逐个文件夹下载文件;若一次性尝试下载所有文件夹,将生成一个需要解压的zip文件。这对于TED数据库尤其成问题,因为您将需要大约两倍于单独下载文件所需的存储空间。我们的GitHub存储库(见相关材料部分)提供了一个便捷的脚本用于下载每个数据库;我们推荐您使用该脚本。
提供机构:
University College London
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作