Chemosensory evolution in the aquatic predator Drosophila enhydrobia
收藏DataCite Commons2026-05-05 更新2026-05-07 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.18831768
下载链接
链接失效反馈官方服务:
资源简介:
This repository contains data and a suite of bioinformatic pipelines and statistical visualization tools designed for the identification, evolutionary analysis, and comparative genomics of gene families (specifically optimized for insect chemosensory receptors), used in the publication: Chemosensory evolution in the aquatic predator Drosophila enhydrobia
Included Pipelines & Tools
blast_exonerate_pipeline.sh: Bash wrapper for the initial homology-based identification and annotation of target gene families. This pipeline identifies genomic regions matching protein queries using BLAST, extracts candidate regions, and refines them with Exonerate. Software dependencies: The following programs must be installed and available in $PATH: BLAST+ (for makeblastdb and tblastn), SAMtools (for faidx), BEDTools, Exonerate, AWK (standard on most Linux systems). The script requires: 1)Genome assemblies directory (A folder containing genome FASTA files). 2)Protein query file (A FASTA file containing protein sequences to search for). Before running the script, edit the following variables: input_folder="path/to/genome_fastas", and query="path/to/protein_queries.fasta". Output: For each genome, the pipeline generates extracted candidate regions and Exonerate annotation files.
InsectOR_pipeline.sh: Bash wrapper for scoring and selecting candidate gene models after homology-based annotation. The script iterates through annotation files generated in previous steps and runs the Perl script scoreGenesOnScaffold.pl to evaluate candidate genes on genomic scaffolds. Software dependencies: The following programs must be installed and available in $PATH: Bash, Perl, the Perl script scoreGenesOnScaffold.pl (https://github.com/sdk15/insectOR). Required inputs: 1)A directory containing paired files produced by the previous pipeline steps, *_GFF.txt and *_parsed_output.fasta. 2)A query FASTA file containing the reference protein sequences. Before running the script, edit the following variables: input_folder="path/to/folder/containing_GFF_and_parsed_fasta_files", perl_script="path/to/scoreGenesOnScaffold.pl", query="path/to/query_proteins.fasta". Run the script as:bash InsectOR_pipeline.sh
AliBaSeq_pipeline.sh: A comprehensive wrapper for the ALiBaSeq framework. It automates BLAST database creation, performs reciprocal searches, and extracts homologous sequences across multiple genome assemblies. Software dependencies: Python 3, BLAST+ (for makeblastdb and blastn), ALiBaSeq (https://github.com/ssuvorov/ALiBaSeq). All dependencies must be available in the system $PATH. The pipeline requires three inputs: 1)Reference genome: A FASTA file used as reference for ortholog detection. 2)Assemblies directory: A directory containing genome assemblies in FASTA format (.fa or .fasta) that will be searched. 3)Baits file: A FASTA file containing the loci to be recovered (e.g., exons, CDS, or probe sequences). Running the pipeline: Make the script executable and run:chmod +x AliBaSeq_pipeline.sh./AliBaSeq_pipeline.sh
HYPHY_Pipeline.py: An end-to-end selection analysis pipeline. It automates tree reconstruction (IQ-TREE), tree rooting (Newick Utilities), and multiple HyPhy selection tests (RELAX, aBSREL, and BUSTED). It parses raw JSON results into formatted CSV summaries and generates an integrated volcano plot for selection intensity. Software dependencies: The following programs must be installed and available in $PATH: Python 3, HyPhy, IQ-TREE, Newick Utilities. Required Python libraries: pandas, matplotlib, seaborn. Required inputs: 1)A directory containing coding sequence alignments in FASTA format (one file per gene). 2)A foreground species list (text file specifying the taxa used as foreground branches for RELAX analyses). Configuration: Before running the script, edit the following variables at the top of the script: IQTREE_MODEL, IQTREE_SEED, FOREGROUND_DEFAULT, and BANNED. Run: python HYPHY_Pipeline.py <input_folder> [--outdir results] [--workers N] [--foreground PREFIX]. The pipeline generates for each gene: -Maximum-likelihood phylogenetic trees (IQ-TREE); -Rooted trees (Newick Utilities); -HyPhy result files (RELAX, aBSREL, BUSTED) in JSON format. It also produces: Parsed CSV summaries of HyPhy results, and a volcano plot summarizing selection intensity across genes.
3D_plot_D2.py: A robust multivariate outlier detection and visualization script. It utilizes Minimum Covariance Determinant (MCD) to calculate Robust Mahalanobis Distances (D²) and employs Local Outlier Factor (LOF) and robust Z-scores to identify exceptional gene family repertoires in a 3D morphospace. Software dependencies: The following software must be installed: Python 3. Required Python libraries: pandas, numpy, scikit-learn, matplotlib, scipy. Required inputs: A CSV file containing the quantitative traits to be analyzed (e.g., gene family counts or other genomic features). The file should contain: -one row per species or sample; -numerical columns representing the variables used to construct the morphospace; -a column containing the species/sample identifiers. Before running the script, modify the input and plotting parameters in the script if necessary, including: the input CSV file path, the columns used for the 3D morphospace, thresholds used for outlier detection, and visualization parameters (labels, colors, figure size). These parameters are defined near the beginning of the script. To run the script us: python 3D_plot_D2.py. The script produces: 1)Robust Mahalanobis distance values (D²) for each sample; 2)LOF outlier scores; 3)robust Z-scores for each variable; 4)a 3D morphospace visualization highlighting outlier taxa; 5)summary tables identifying statistically exceptional gene family repertoires.
Datasets, Intermediate Files, Trees & Similar
IQTREE.zip: All datasets used for IQTREE analyses together with partitionning files and resulting trees.
Virilis_section_tree.zip: Datasets used for the multilocus phylogeny of the virilis section.
Supplementary_data.xls: Excel file with genome information, gene counts, and analysis details.
ChemGenSeqs.zip: Chemosensory gene sequences (nt) predicted in this study.
Genome_assemblies.zip: Genome assemblies from D. enhydrobia (SAMN56506040), D. picta (SAMN56506041), and D. flexa (SAMN56506042).
MiniProt_predicted_proteins.zip: Predicted protein-coding genes across 13 Siphlodora and outgroup species.
提供机构:
Zenodo
创建时间:
2026-05-05



