CROP: A feature-independent context-aware method for CRISPR-Cas9 frameshift prediction: Preprocessed Δlength datasets and test splits.

Figshare2026-01-13 更新2026-04-28 收录

下载链接：

https://figshare.com/articles/dataset/CROP_A_feature-independent_context-conditioned/30998056

下载链接

链接失效反馈

官方服务：

资源简介：

All datasets and test sets utilized in the study "CROP: A feature-independent context-aware method for CRISPR-Cas9 frameshift prediction." These data include the 19 primary datasets used for CROP's training and testing, as well as the supplementary datasets referenced throughout the paper.The datasets come from the following studies:Predicting the mutations generated by repair of Cas9-induced double-strand breaks (FORECasT)Predictable and precise template-free CRISPR editing of pathogenic variants (inDelphi)Large dataset enables prediction of repair after CRISPR–Cas9 editing in primary T cells (SPROUT)Deep sampling of gRNA in the human genome and deep-learning-informed prediction of gRNA activities (Aldit)X-CRISP: domain-adaptable and interpretable CRISPR repair outcome prediction (X-CRISP preprocessing of inDelphi + FORECasT)The provided datasets are structured with a "sequence" column, representing the target sequence, a "PAM position" column, and several columns representing unnormalized numerical data for each Δlength. Please note that any additional columns were utilized for preliminary testing and should be disregarded during primary analysis.Please observe the following guidelines regarding model selection and training:FORECasT: Do not utilize the original FORECasT dataset and the X-CRISP version of FORECasT simultaneously during training as they don't correlate well enough.SPROUT: We advise against using both versions of SPROUT. The standard version is recommended over the CROTON version, as the latter is easier to predict due to the removal of mixed events and long insertions prior to Δlength calculation.Furthermore, please be aware that datasets originating from the same study frequently contain a high proportion of shared target sequences. This overlap should be accounted for when partitioning data to prevent information leakage between training and evaluation sets.We provide two distinct test splits used to evaluate CROP's performance. The first, detailed in Section 3.1 (“CROP outperforms state-of-the-art models in frameshift prediction”), benchmarks CROP against existing methods. In this configuration, the SPROUT dataset (both variants) is excluded entirely from training to facilitate a direct comparison against all the other methods, which did not train on that dataset (But tested on the CROTON version of SPROUT). The second split supports the final model analysis presented in Figure 4A and the interpretability results. For this split, we trained on all datasets described in this study, with the exception of the CROTON version of SPROUT (while using the normal version of SPROUT), and evaluated the performance across all test sets.

创建时间：

2026-01-13