CRISPRome: A Comprehensive CRISPR-Cas Resource Database
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14975124
下载链接
链接失效反馈官方服务:
资源简介:
CRISPRome is a large publicly available CRISPR-Cas system database to date, systematically characterizing CRISPR-Cas elements across a diverse set of bacterial and archaeal genomes. This resource serves as a foundation for studying the global distribution and characteristics of CRISPR systems in microbes.
Data Collection
To construct CRISPRome, we compiled data from multiple sources:
RefGen Database:
1,414,360 bacterial genomes
13,258 archaeal genomes
Extracted from GenBank and RefSeq
Metagenomic Datasets:
26,565 genomes/metagenome-assembled genomes (MAGs)
Collected from three human body sites and three natural environments
CRISPR-Cas System Identification
We identified CRISPR-Cas systems using CRISPRCasFinder v2.0.355, extracting both spacers and CRISPR arrays:
Spacers:
21,282,263 total spacers
14,968,498 (70%) associated with Cas proteins
CRISPR Arrays:
3,937,928 total arrays
1,429,909 (36%) linked to Cas proteins
Spacer-Target Identification Process
Spacer Deduplication & Alignment
Spacers were deduplicated and aligned using Blastn-short to a database of 7,042,467 genomes from RefSeq, GenBank, ICEberg, IMG/VR, and PLSDB.
makeblastdb was used to convert genome sequences into searchable databases.
Blastn Alignment & Optimization
Blastn was run with optimized parameters (word_size 11, qcov_hsp_perc 95, perc_identity 95, max_hsps 3) to balance speed and accuracy.
High-performance computing clusters managed by Slurm were used for large-scale processing.
Filtering & Prioritization
Hits were sorted by bitscore, keeping the top 10% for accuracy.
Prioritization order: viruses > plasmids = ICE > other sources, reflecting CRISPR-Cas immunity targets.
Removing CRISPR Array Artifacts
Alignments overlapping known CRISPR array regions were removed to avoid false positives.
Additional CRISPR arrays were identified from NCBI genome annotations to refine filtering.
Detecting Potential CRISPR Arrays
Arrays were defined by detecting regularly interspaced protospacers.
A threshold of ≥4 protospacers was used to minimize false positives, leading to the exclusion of 16,777 targets.
LCA-Based One-to-One Spacer-Target Assignment
TaxonKit lca was used to determine the Lowest Common Ancestor (LCA) of multiple target matches.
This ensured each spacer was assigned to a single, biologically relevant target.
创建时间:
2025-03-05



