five

cryoPANDA: A 37-million-particle dataset from over 250 experiments to accelerate data-driven cryo-EM analysis

收藏
DataCite Commons2026-04-30 更新2026-05-05 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=8a504807b5f947e58a1c57d7ff7a9658
下载链接
链接失效反馈
官方服务:
资源简介:
Cryogenic Electron Microscopy (cryo-EM) is a powerful imaging technique that has revolutionized structural biology over the last decade. Cryo-EM images, also known as micrographs, contain two main components: background noise and 2D protein projections, known as particles. These particles serve as the primary input for downstream analysis, which ultimately aims to produce a 3D reconstruction of the protein and, at sufficient resolution, the protein's atomic structure. However, manipulating these data poses significant challenges due to their low signal-to-noise ratio (SNR), varying defocus planes, and high variability across rotational orientations and diverse protein structures. Existing cryo-EM datasets primarily provide micrographs, with limited particle-level collections, which restricts coverage of particle-level variability across EMPIAR experiments. To address this gap, we introduce the cryo-EM Particles ANnotated DAtaset (cryoPANDA), a large-scale cryo-EM particle dataset that is more than 10-fold larger than prior particle collections. cryoPANDA comprises over 37 million particles from 252 cryo-EM experiments spanning a wide range of protein types, totaling more than 13~TB of data. For each experiment, the dataset includes its particles, per-particle annotations, the corresponding 3D electrostatic potential map, the published Electron Microscopy Data Bank map and Protein Data Bank model, where available.The data are arranged in separate directories, one per cryo-EM experiment. Each experiment directory contains two zipped subdirectories: particles.zip, which stores restacked particles in batches of 1,000, and particles_information.zip, which contains metadata related to restack, 2D classification, and 3D reconstruction jobs. These jobs are further organized into dedicated subdirectories and include output files from cryoSPARC, such as .png images and metadata files in .cs, .mrc, and .star formats. In addition to these two subdirectories, the main experiment directory contains (i) particles.star, which provides particle annotations required for importing particles into software packages such as cryoSPARC, (ii) extended per-particle annotations in .xlsx format, (iii) the reconstructed electrostatic potential map produced by cryoPANDA in .mrc format, (iv) the corresponding EMDB map in .map format, and, where available, (v) an associated PDB atomic model in .cif format.The cryoPANDA dataset is also provided in HDF5 format, chunked by experiment into 252 separate .h5 files. Within each HDF5 file, particle images and per-particle annotations can be accessed under the keys 'particles' and 'annotations', respectively. All particle images are resampled to 224x224 pixels, stored as uint8, and normalized per image to the [0–255] range. Combined with the smaller total size (1.8~TB) and the ease of distribution, this format significantly improves sampling efficiency during the training of machine-learning models.
提供机构:
Science Data Bank
创建时间:
2026-04-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作