CRISPRome: A Comprehensive CRISPR-Cas Resource Database

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/14975124

下载链接

链接失效反馈

官方服务：

资源简介：

CRISPRome is a large publicly available CRISPR-Cas system database to date, systematically characterizing CRISPR-Cas elements across a diverse set of bacterial and archaeal genomes. This resource serves as a foundation for studying the global distribution and characteristics of CRISPR systems in microbes. Data Collection To construct CRISPRome, we compiled data from multiple sources: RefGen Database: 1,414,360 bacterial genomes 13,258 archaeal genomes Extracted from GenBank and RefSeq Metagenomic Datasets: 26,565 genomes/metagenome-assembled genomes (MAGs) Collected from three human body sites and three natural environments CRISPR-Cas System Identification We identified CRISPR-Cas systems using CRISPRCasFinder v2.0.355, extracting both spacers and CRISPR arrays: Spacers: 21,282,263 total spacers 14,968,498 (70%) associated with Cas proteins CRISPR Arrays: 3,937,928 total arrays 1,429,909 (36%) linked to Cas proteins Spacer-Target Identification Process Spacer Deduplication & Alignment Spacers were deduplicated and aligned using Blastn-short to a database of 7,042,467 genomes from RefSeq, GenBank, ICEberg, IMG/VR, and PLSDB. makeblastdb was used to convert genome sequences into searchable databases. Blastn Alignment & Optimization Blastn was run with optimized parameters (word_size 11, qcov_hsp_perc 95, perc_identity 95, max_hsps 3) to balance speed and accuracy. High-performance computing clusters managed by Slurm were used for large-scale processing. Filtering & Prioritization Hits were sorted by bitscore, keeping the top 10% for accuracy. Prioritization order: viruses > plasmids = ICE > other sources, reflecting CRISPR-Cas immunity targets. Removing CRISPR Array Artifacts Alignments overlapping known CRISPR array regions were removed to avoid false positives. Additional CRISPR arrays were identified from NCBI genome annotations to refine filtering. Detecting Potential CRISPR Arrays Arrays were defined by detecting regularly interspaced protospacers. A threshold of ≥4 protospacers was used to minimize false positives, leading to the exclusion of 16,777 targets. LCA-Based One-to-One Spacer-Target Assignment TaxonKit lca was used to determine the Lowest Common Ancestor (LCA) of multiple target matches. This ensured each spacer was assigned to a single, biologically relevant target.

创建时间：

2025-03-05

5,000+

优质数据集

54 个

任务类型

进入经典数据集