3. Ecological genomics of the Northern krill: Genome assembly annotations (genes and repeats)
收藏figshare.scilifelab.se2024-03-28 更新2025-03-26 收录
下载链接:
https://figshare.scilifelab.se/articles/dataset/3_Ecological_genomics_of_the_Northern_krill_Genome_assembly_annotations_genes_and_repeats_/22786925/1
下载链接
链接失效反馈官方服务:
资源简介:
This item holds multiple gene and repeat model and annotation files, including coordinates in GFF/GTF formats, TXT/TSV table and sequences in FASTA format. It also contains some accessory RNA-seq gene resources, such as Trinity-assembled transcripts and Nanopore cDNA sequences that were used at various stages of assembly and annotation.
Coordinates refer to the main genome assembly reference sequence (1.m_norvegica.main_w_mito.fasta) but focus on the nuclear genome assembly and rarely include features of the mitochondrial assembly. Mitochondrial annotations are provided separately (see below).
Contents:
trinity_transcripts.tar.gz, an archive with n=573,869 RNA transcripts that have been assembled with Trinity using Illumina RNA-seq data in FASTA format.
trinity_transcript.16509_single_isoforms.cds.fasta.tar.gz, a subset of 16,509 single (longest) isoform of putatively protein-coding transcripts used to assess genome assembly metrics such as duplication and base-level error. Sequences are in FASTA format.
nanopore_cDNA.representative_sequences_vsearch.tar.gz, n=25,484 cDNA Nanopore sequence reads used to filter gene models and scaffold the genome.
annotations.all_genes_and_isoforms.redundant.tar.gz, an archive with all (n=202,138) gene models and isoforms/alternative splice variants, including also non-protein coding genes.
annotations.protein_coding_gene_models.non_redundant.gff3, a non-redundant (i.e. single-isoform) set of putative protein-coding gene bodies (n=42,227) in standard GFF3 format.
annotations.protein_coding_gene_models.non_redundant.CDS.fasta, the matching set of putative protein-coding genes in FASTA format (CDS nucleotide sequences).
annotations.protein_coding_gene_models.non_redundant.PEP.fasta, the matching set of putative protein sequences in FASTA format (PEP peptide sequences).
annotations.protein_coding_gene_models.non_redundant.PEP.fasta.BLAST.DROSOPHILA.tsv.tar.gz, output from BLASTP analyses between Northern krill and Drosophila peptide sequences (BLAST outfmt 6).
annotations.protein_coding_gene_models.non_redundant.PEP.fasta.EnTAP.final_annotations_lvl1.tsv, main output from EnTAP functional annotations of protein coding genes.
annotations.protein_coding_gene_models.non_redundant_added_stop_codons.gff, non-redundant protein-coding models as above, but missing stop-codons have been added if detected in the reference genome assembly (GFF format).
annotations.protein_coding_gene_models.non_redundant_added_stop_codons.CDS.fasta, but missing stop-codons have been added if detected in the reference genome assembly (FASTA format).
mitochondrion.tar.gz, an archive with gene coordinates and sequences of tRNAs, rRNAs, protein-coding genes and repeat features on the mitochondrial chromosome, as inferred using MITOS2. Files are standard BED/GFF/TSV/TXT/FASTA files and more information about formats can be found on the site for the original tool: http://mitos2.bioinf.uni-leipzig.de/help.py
annotations.repeat_library.fasta, a custom set of n=10,909 non-redundant repeat sequences in FASTA format that were used to annotate the genome for repeats using RepeatMasker.
annotations.repeats_across_the_genome_repeatmasker.tbl, the standard RepeatMasker masking overview output table.
annotations.repeats_across_the_genome_repeatmasker.out.tar.gz, the full set of masked repeats and their coordinates across the genome.
trinity_transcripts.tar.gz
This archive contains the assembled transcripts assembled from RNA-seq data produced from six RNA extractions/tissues of the reference specimen. There are three FASTA files:
trinity_transcripts.all_genes_and_isoforms.fasta = all assembled transcripts (n=573,869)
trinity_transcripts.metazoan_genes_and_isoforms.CDS.fasta = a subset of n=60,677 assembled and putatively coding transcripts with best hits against Metazoan sequences (CDS nucleotide sequences)
trinity_transcripts.metazoan_genes_and_isoforms.PEP.fasta = the n=60,677 corresponding peptide sequences.
nanopore_cDNA.representative_sequences_vsearch.tar.gz
This archive contains putatively full-length cDNA reads in three FASTA files:
clusters.fa = VSEARCH cluster representatives (i.e. cluster centroids with low error rates) that retain the original Nanopore sequence headers (n=25,484)
clusters.renamed.fa = as above, but renamed with simple incrementing headers.
clusters.renamed.min_500bp.fa = as above, but only reads longer than 500 bp (n=24,632). These reads were used to scaffold the genome.
annotations.all_genes_and_isoforms.redundant.tar.gz
This archive contains gene models in four files:
annotations.all_genes_and_isoforms.redundant.gtf, coordinates in GTF format
annotations.all_genes_and_isoforms.redundant.gff3, coordinates in GFF3 format
annotations.all_genes_and_isoforms.redundant.fasta, sequences in FASTA format
annotations.all_genes_and_isoforms.redundant.transcripts.tsv, a TSV table with three fields specifying: 1) the final name of the isoform/splice variant; 2) the name of the gene model it belongs to; 3) the original name the isoform.
These models were consolidated into loci using GFFCOMPARE from multiple sources of data, including RNA and comparative data. The names of the original isoforms indicate source:
STRG = HISAT/STRINGTIE RNA-seq gene model. Tagged "REF_STRG" in the final gene model.
mRNA = Assembled Trinity transcript. Tagged "REF_TRIN" in the final gene model.
COMPARATIVE_SPALN = Comparative model derived other crustaceans.
GFF and GTF format specifications are available here:
https://www.ensembl.org/info/website/upload/gff.html
https://www.ensembl.org/info/website/upload/gff3.html
annotations.protein_coding_gene_models.non_redundant.(gff3|CDS.fasta|PEP.fasta)
These files contains a filtered set of the "best" model isoform of each locus (n=42,227) in total, which were determined by comparison to NCBI RefSeq. These models were used to annotate SNPs, infer homology/orthology, gene family evolution and molecular evolution.
annotations.repeat_library.fasta
This FASTA file contains the representative and non-redundant template repeat sequences that were used to annotate the Northern krill genome for interspersed repeats. The sequence headers indicate several aspects of each repeat.
Example: "seq_c_98391_5186_12351_FIN_ReC99C#LTR/Pao"
This indicates that the template is:
located on sequence seq_c_98391 with start/stop coordinates 5186/12351
originally detected using LTR_Finder ("FIN")
classified as "LTR/Pao" using RepeatClassifier ("ReC")
has 99% identity between the 5' and 3' LTRs ("99") and was considered complete, with respect to the expected protein domains detected along the repeat.
Additional tags and nomenclature are described in the paper methods.
本数据集汇集了多份基因和重复模型以及注释文件,包括GFF/GTF格式的坐标、TXT/TSV表格和FASTA格式的序列。此外,还包括某些辅助RNA-seq基因资源,如Trinity组装的转录本和Nanopore cDNA序列,这些序列在组装和注释的不同阶段被使用。
坐标参考主要基因组组装参考序列(1.m_norvegica.main_w_mito.fasta),但重点在于核基因组组装,很少包括线粒体组装的特征。线粒体注释另行提供(见下文)。
内容如下:
trinity_transcripts.tar.gz,一个包含n=573,869个RNA转录本(使用Illumina RNA-seq数据由Trinity组装)的归档文件,格式为FASTA。
trinity_transcript.16509_single_isoforms.cds.fasta.tar.gz,16,509个单(最长)同源形式的假设蛋白编码转录本的子集,用于评估基因组组装指标,如重复和碱基级别的错误。序列格式为FASTA。
nanopore_cDNA.representative_sequences_vsearch.tar.gz,包含n=25,484个cDNA Nanopore序列读数的归档文件,用于过滤基因模型和构建基因组骨架。
annotations.all_genes_and_isoforms.redundant.tar.gz,一个包含所有(n=202,138)基因模型和等位基因/可变剪接变体的归档文件,包括非蛋白编码基因。
annotations.protein_coding_gene_models.non_redundant.gff3,一个非冗余(即单同源)的假设蛋白编码基因体集合(n=42,227),格式为标准GFF3。
annotations.protein_coding_gene_models.non_redundant.CDS.fasta,匹配的假设蛋白编码基因集合,格式为FASTA(CDS核苷酸序列)。
annotations.protein_coding_gene_models.non_redundant.PEP.fasta,匹配的假设蛋白序列集合,格式为FASTA(PEP肽序列)。
annotations.protein_coding_gene_models.non_redundant.PEP.fasta.BLAST.DROSOPHILA.tsv.tar.gz,来自北磷虾和果蝇肽序列之间BLASTP分析的输出(BLAST outfmt 6)。
annotations.protein_coding_gene_models.non_redundant.PEP.fasta.EnTAP.final_annotations_lvl1.tsv,蛋白编码基因的EnTAP功能注释的主要输出。
annotations.protein_coding_gene_models.non_redundant_added_stop_codons.gff,与上述非冗余蛋白编码模型相同,但如在线粒体基因组组装中检测到缺失的终止密码子则已添加(GFF格式)。
annotations.protein_coding_gene_models.non_redundant_added_stop_codons.CDS.fasta,与上述非冗余蛋白编码模型相同,但如在线粒体基因组组装中检测到缺失的终止密码子则已添加(FASTA格式)。
mitochondrion.tar.gz,一个包含线粒体染色体上tRNAs、rRNAs、蛋白编码基因和重复特征的基因坐标和序列的归档文件,这些是通过MITOS2推断得出的。文件为标准BED/GFF/TSV/TXT/FASTA格式,更多关于格式信息可在原始工具网站上找到:http://mitos2.bioinf.uni-leipzig.de/help.py
annotations.repeat_library.fasta,一个包含n=10,909个非冗余重复序列的定制集合,格式为FASTA,这些序列用于使用RepeatMasker注释基因组中的重复序列。
annotations.repeats_across_the_genome_repeatmasker.tbl,RepeatMasker掩蔽概述输出表。
annotations.repeats_across_the_genome_repeatmasker.out.tar.gz,基因组中所有掩蔽重复序列及其坐标的完整集合。
trinity_transcripts.tar.gz
此归档文件包含从参考标本的六个RNA提取/组织中产生的RNA-seq数据组装的转录本。
有三个FASTA文件:
trinity_transcripts.all_genes_and_isoforms.fasta = 所有组装的转录本(n=573,869)
trinity_transcripts.metazoan_genes_and_isoforms.CDS.fasta = 与动物序列最佳匹配的n=60,677组装和假设编码转录本的子集(CDS核苷酸序列)
trinity_transcripts.metazoan_genes_and_isoforms.PEP.fasta = 相应的n=60,677肽序列。
nanopore_cDNA.representative_sequences_vsearch.tar.gz
此归档文件包含假设全长cDNA读数,有三个FASTA文件:
clusters.fa = VSEARCH聚类代表(即具有低错误率的聚类中心)保留原始Nanopore序列头(n=25,484)
clusters.renamed.fa = 如上所述,但使用简单的递增头重命名
clusters.renamed.min_500bp.fa = 如上所述,但只包含长度超过500 bp的读数(n=24,632)。这些读数被用于构建基因组骨架。
annotations.all_genes_and_isoforms.redundant.tar.gz
此归档文件包含四个文件中的基因模型:
annotations.all_genes_and_isoforms.redundant.gtf,坐标格式为GTF
annotations.all_genes_and_isoforms.redundant.gff3,坐标格式为GFF3
annotations.all_genes_and_isoforms.redundant.fasta,序列格式为FASTA
annotations.all_genes_and_isoforms.redundant.transcripts.tsv,一个TSV表格,包含三个字段指定:1)等位基因/剪接变体的最终名称;2)所属基因模型的名称;3)等位基因的原始名称。
这些模型通过GFFCOMPARE从包括RNA和比较数据在内的多个数据源整合到位点中。原始等位基因的名称指示来源:
STRG = HISAT/STRINGTIE RNA-seq基因模型。在最终基因模型中标记为“REF_STRG”。
mRNA = 组装的Trinity转录本。在最终基因模型中标记为“REF_TRIN”。
COMPARATIVE_SPALN = 从其他甲壳类动物衍生的比较模型。
GFF和GTF格式的规范可在以下链接中找到:
https://www.ensembl.org/info/website/upload/gff.html
https://www.ensembl.org/info/website/upload/gff3.html
annotations.protein_coding_gene_models.non_redundant.(gff3|CDS.fasta|PEP.fasta)
这些文件包含每个位点“最佳”模型等位基因的过滤集合(总共n=42,227),这些等位基因通过与NCBI RefSeq的比较确定。这些模型用于注释SNPs、推断同源/直系同源、基因家族演化和分子演化。
annotations.repeat_library.fasta
此FASTA文件包含用于注释北极磷虾基因组中散在重复序列的代表性和非冗余模板重复序列。序列头指示每个重复的几个方面。
示例:“seq_c_98391_5186_12351_FIN_ReC99C#LTR/Pao”
这表明模板位于序列seq_c_98391上,起始/终止坐标为5186/12351,最初使用LTR_Finder(“FIN”)检测到,使用RepeatClassifier(“ReC”)分类为“LTR/Pao”,5'和3'LTR之间有99%的同源性(“99”),并且被认为在沿重复的预期蛋白结构域中是完整的。
附加标签和命名法在论文方法中描述。
提供机构:
figshare.scilifelab.se



