Supplemental datasets for: 'Integrating terminal-free sequence modeling and explainability to resolve 'dark matter' in algal genomics'

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/13920000

下载链接

链接失效反馈

官方服务：

资源简介：

This repository contains supplementary datasets for 'Integrating terminal-free sequence modeling and explainability to resolve 'dark matter' in algal genomics'. The preprint is available at https://doi.org/10.48550/arXiv.2411.06798. Data S1 | Data S1 | Neural network training sequences (4GB). Dataset comprising DNA sequences processed in two formats: with terminal information (TI-inclusive) and terminal information-free (TI-free). Data S2 | Training and inference scripts. The LA4SR framework integrated several open-source software packages and models. We employed LORA (Low-Rank Adaptation (https://github.com/microsoft/LoRA)), PEFT (Parameter-Efficient Fine-Tuning (https://github.com/huggingface/peft), and QLORA (Quantized Low-Rank Adaptation (https://github.com/artidoro/qlora) for parameter-efficient post-training and used Mamba (https://github.com/state-spaces/mamba) as an alternative to transformer-based architectures. The Hugging Face Transformers library (https://github.com/huggingface/transformers) facilitated implementation, pretraining, and post-training of the open-source models. Training and inference were carried out on the High Performance Computing resources at New York University Abu Dhabi (Jubail HPC cluster), with jobs going to NVIDIA (Santa Clara, CA, USA) V100, A100, or H100 nodes. Data S3 | Interpretability scripts, including DeepMotifMinerPro. 10.5281/zenodo.13920001. Includes scripts for the implementation of the custom explainer programs presented with this work, including Captum-, DeepLift, and SHAP-based approaches (Data S3) to explain how different amino acid residues and their patterns and positions affect model decisions. Data S4 | Real-world sequencing data. To validate our approach and address real-world challenges, we applied LA4SR models to new data from new clean and contaminated isogenic algae cultures. We cultured and sequenced ten separate isogenic colonies of Chlamydomonas reinhardtii CC-1883. Of these, nine were sequenced with Illumina 150 bp paired-end short reads and one with Pacific Biosciences (PacBio, Menlo Park, CA, USA) HiFi reads and DoveTail (Sydney, Australia) Hi-C to generate a complete, axenic reference assembly (Fig. S3; Data S4).

创建时间：

2025-02-14