Enformer Celltyping training positions and cell types
收藏Mendeley Data2024-01-31 更新2024-06-27 收录
下载链接:
https://figshare.com/articles/dataset/Enformer_Celltyping_training_positions_and_cell_types/22040393
下载链接
链接失效反馈官方服务:
资源简介:
The training positions and cell types for the Enformer Celltyping model. More on this here. In detail: When training Enformer Celltyping, we used the following approach to identify training positions: 1. Bin genome based on predictive window 2. Filter bins to select training set based on DNA and cell type filters. * DNA filters: 1. Leave buffer at start/end of chromosome large enough for DNA and local chromatin accessibility windows 2. Not in blacklist regions * Cell type filters: 1. Coverage for the histone mark > 12.5% of the returned window to prioritise training on regions with peaks. 3. Down sample resulting regions to equal the lowest count of regions for any histone mark so each hist mark has equal representation. This avoids the model biasing training on one mark. This results in 67,007 training & validation positions (cell type and genomic region combinations) and 14,188 unique genomic positions which is similar to number of positions basenji & enformer trained on (14,533). The approach ensures model sees peaks for all histone marks. The validation set positons are randomly shifted by up to a quarter of the predictive window so the model's performance doesn't overfit to the initial genomic bins.
创建时间:
2024-01-31



