CryoCRAB: A Large-scale Curated and Filterable Dataset for Cryo-EM Foundation Model Pre-training
收藏DataCite Commons2025-04-27 更新2025-04-16 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=14986b67bc4b4816b777c8d378925dc7
下载链接
链接失效反馈官方服务:
资源简介:
Cryo-electron microscopy (cryo-EM) is a revolutionary biological imaging technique that enables near-atomic resolution three-dimensional reconstruction of proteins in a state close to their native form, playing a crucial role in structural biology and drug development. However, the extremely low signal-to-noise ratio and the complexity of data processing in cryo-EM significantly hinder the efficiency and accuracy of high-resolution structure determination, with conventional methods often unable to meet the requirements for high-quality reconstruction. In recent years, foundational models have shown remarkable potential in other biological imaging fields, particularly in medical imaging, where they have achieved significant advancements. These models leverage self-supervised learning to extract universal features from large-scale unlabeled data, effectively supporting zero-shot and few-shot tasks.However, in the cryo-EM field, despite the promising prospects of foundational models, their application has been significantly limited due to the lack of large-scale, high-quality, standardized datasets. To address this bottleneck, we present CryoCRAB, the first large-scale dataset specifically designed for the training of cryo-EM foundational models. CryoCRAB includes 746 protein types and comprises 152,385 processed micrographs. Given the high noise characteristics of cryo-EM images, CryoCRAB processes each movie frame by splitting them into odd and even frames, generating paired micrographs, thus providing diverse and rich data for the training of denoising models. Moreover, the data is stored in HDF5 chunked format, significantly improving random sampling efficiency and model training speed compared to traditional storage methods.The dataset consists of a root directory containing 746 subdirectories, each representing a sub-dataset named after its corresponding EMPAIR ID.In the root directory, there is a file named all_metadata.json, which serves as a centralized metadata file for all sub-datasets. This file includes information such as the original movie paths, gain paths, motion parameters, CTF parameters, and more for each sub-dataset.Each subdirectory (representing a sub-dataset) contains the following four files:empair_*.json: The metadata file for the specific sub-dataset, mirroring the content of the corresponding entry in all_metadata.json.micrograph.tar.gz: A compressed archive containing 200 full-diff micrograph pairs in mrc format for the sub-dataset.micrograph_h5.tar.gz: A compressed archive containing 200 full-diff micrograph pairs in H5 format for the sub-dataset.background.tar.gz: A compressed archive containing the estimated background for 200 micrographs in the sub-dataset.
提供机构:
Science Data Bank
创建时间:
2024-12-25



