five

Phaeocystis globosa colonial gene expression

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/1476490
下载链接
链接失效反馈
官方服务:
资源简介:
Data and analysis for the paper:  Differential gene expression supports a resource-intensive, defensive role for colony production in the bloom-forming haptophyte, Phaeocystis globosa by: Margaret Mars Brisbin and Satoshi Mitarai The Phaeocystis globosa CCMP1528 transcriptome used in the study (phaeocystisglobosa_euk_seqs.fasta or pg_euk_seqs_altnames.fasta) was assembled with trimmed sequencing reads from 8 biological replicates (4 colonial replicates and 4 solitary replicates) with the Trinity software (v2.3.2). Raw sequencing reads are available from the NCBI SRA with accession numbers: SRR7811979–SRR7811986. Before assembling the transcriptome, reads were quality filtered and trimmed with the Trimmomatic software (v3.36) using the command: java -jar $TRIM/trimmomatic-0.36.jar PE -phred33 $DATA2/S${SLURM_ARRAY_TASK_ID}_S*_R1_001.fastq.gz \ $DATA2/S${SLURM_ARRAY_TASK_ID}_S*_R2_001.fastq.gz \ $OUT/S${SLURM_ARRAY_TASK_ID}_1_paired.fq $OUT/S${SLURM_ARRAY_TASK_ID}_1_unpaired.fq \ $OUT/S${SLURM_ARRAY_TASK_ID}_2_paired.fq $OUT/S${SLURM_ARRAY_TASK_ID}_2_unpaired.fq \ ILLUMINACLIP:$TRIM/adapters/NexteraPE-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 Trimmed reads were mapped to the ERCC reference sequences for Mix1 and mapped reads were filtered using the following commands from bowtie2 (v2.2.6), samtools, and bedtools:  bowtie2 -t -x $REF \ -1 $DATA/S${SLURM_ARRAY_TASK_ID}_1_paired.fq \ -2 $DATA/S${SLURM_ARRAY_TASK_ID}_2_paired.fq \ -S $OUT/S${SLURM_ARRAY_TASK_ID}_ercc.sam samtools view -bS $DATA/S${SLURM_ARRAY_TASK_ID}_ercc.sam >$DATA/S${SLURM_ARRAY_TASK_ID}.bam samtools sort $DATA/S${SLURM_ARRAY_TASK_ID}.bam $DATA/S${SLURM_ARRAY_TASK_ID}_sorted samtools view -b -f 13 S${SLURM_ARRAY_TASK_ID}_sorted.bam > S${SLURM_ARRAY_TASK_ID}_unmapped.bam samtools sort -n $DATA/S${SLURM_ARRAY_TASK_ID}_unmapped.bam $DATA/S${SLURM_ARRAY_TASK_ID}.qsort bedtools bamtofastq -i $DATA/S${SLURM_ARRAY_TASK_ID}.qsort.bam -fq $DATA/S${SLURM_ARRAY_TASK_ID}_1_paired.fq -fq2 $DATA/S${SLURM_ARRAY_TASK_ID}_2_paired.fq The resulting Trimmed reads without ERCC sequences were used to make the transcriptome assembly:  Trinity --seqType fq --max_memory 475G \ --left $DATA2/C1_1_paired.fq,$DATA2/C2_1_paired.fq,$DATA2/C3_1_paired.fq,$DATA2/C4_1_paired.fq,$DATA2/S1_1_paired.fq,$DATA2/S2_1_paired.fq,$DATA2/S3_1_paired.fq,$DATA2/S4_1_paired.fq \ --right $DATA2/C1_2_paired.fq,$DATA2/C2_1_paired.fq,$DATA2/C3_2_paired.fq,$DATA2/C4_2_paired.fq,$DATA2/S1_2_paired.fq,$DATA2/S2_2_paired.fq,$DATA2/S3_2_paired.fq,$DATA2/S4_2_paired.fq \ --CPU 12 The Trinity assembly was dereplicated with CD-HIT-EST (v2016-0304) at 95% :  cd-hit-est -i $DATA/Trinity.fasta -o Trinity_Pg_clustered_95 -c 0.95 -n 8 -p 1 -g 1 -M 200000 -T 8 -d 40 The Trinity assembly was filtered to remove bacterial contamination by first running a blastn(v2.6.0+) against the nr/nt NCBI database: blastn -query $DATA/Trinity_Pg_clustered_95.fasta -task blastn -db $REF -num_threads 12 -max_target_seqs 1 -outfmt 5 > TrinityBlast.xml and then removing bacterial reads with custom python scripts included here: TrinityBlastXML.ipynb and FIlterTrinityEukNotEuk.ipynb  RSEM (v1.2.22) was run with the final transcriptome assembly (phaeocystisglobosa_euk_seqs.fasta or pg_euk_seqs_altnames.fasta):  rsem-calculate-expression --bowtie2 --paired-end \ $DATA/C${SLURM_ARRAY_TASK_ID}_1_paired.fq \ $DATA/C${SLURM_ARRAY_TASK_ID}_2_paired.fq \ $REF/rsemref_longISO/pg_euks_RSEMref \ $REF/rsemout_longISO/C${SLURM_ARRAY_TASK_ID} rsem-calculate-expression --bowtie2 --paired-end \ $DATA/S${SLURM_ARRAY_TASK_ID}_1_paired.fq \ $DATA/S${SLURM_ARRAY_TASK_ID}_2_paired.fq \ $REF/rsemref_longISO/pg_euks_RSEMref \ $REF/rsemout_longISO/S${SLURM_ARRAY_TASK_ID} The resulting data files are: C*.genes.results and S*.genes.results which were used with DESeq2 in the R environment to analyze different gene expression. The code for these analyses is available in html and R markdown (PhaeoColSol_DE.html, PhaeoColSol_DE.Rmd).  The transcriptome assembly was annotated with the Dammit software (v1.0rc2), which wraps Transdecoder, HMMER, and BUSCO, and by submitting the translated amino acid sequences to GhostKOALA.  The raw pfam Dammit annotation results are included: pg_euk_seqs.fasta.x.pfam.gff3. These results were parsed with the script: Pfam_gffParsing.ipynb. The resulting file, pfam_parsed_annotation.csv, is used in the script PhaeoColSol_DE.Rmd with pfam2go4R.txt for GO enrichment analysis. The script shinycolsol.Rmd creates an interactive plot of GO enrichment results.  The GhostKOALA results are user_ko.csv, and are used in the script PhaeoColSol_DE.Rmd for KEGG pathway enrichment analysis.
创建时间:
2024-08-02
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作