Updated short variant, gene and transposable element predictions for P. nodorum isolates.
收藏Mendeley Data2024-06-27 更新2024-06-27 收录
下载链接:
https://figshare.com/articles/dataset/Updated_gene_and_transposable_element_predictions_for_P_nodorum_isolates_SN15_SN4_SN79_and_SN2000/13340975
下载链接
链接失效反馈官方服务:
资源简介:
Gene, repeat, and transposable element predictions for the Parastagonospora nodorum isolates analysed. Short variant predictions for the P. nodorum pangenome relative to SN15 are also included in VCF format. (combined.vcf.gz) Many of these isolates, including SN15, are also deposited to the NCBI under bioprojects: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA612761, and https://www.ncbi.nlm.nih.gov/bioproject/PRJNA686477. These assemblies are slightly different due to the submission requirements of the NCBI. The assemblies here are the same as the ones analysed in the manuscript.The file changes.tsv contains details of scaffold splits and removed genes from illumina sequenced isolates. The file gene_changes.tsv includes further details of removed genes in the illumina sequenced isolates. The only major changes to SN15 (other than removal of low-confidence genes) are that 100bp of N's was removed from the 5' end of Chromosome 7 (meaning that all features are shifted over by 100bp), and one gene was pseudogenised due to an internal stop codon. The isolates sequenced as part of Richards 2018 (https://doi.org/10.1534/g3.117.300462; SN4, SN2000, SN79) and Syme 2018 (https://doi.org/10.1093/gbe/evy192; RSID*) are deposited here only, as we don't control the NCBI entries and there are currently no publicly available annotations for these genomes. Each zipped folder will unzip to contain something like the following structure: 20200513-14FG141-genome.fasta # The actual assembly.20200513-14FG141-genome.fasta.md520200513-14FG141-genome_contigs.gff3 # Mapping contics to scaffolds.20200513-14FG141-genome_contigs.gff3.md520200513-14FG141-genome_softmasked.fasta # A soft-masked version of the assembly for further gene prediction.20200513-14FG141-genome_softmasked.fasta.md520200513-14FG141-mitochondrial.fasta # The mitochondrial assembly for this isolate.20200513-14FG141-mitochondrial.fasta.md520200513-14FG141-mitochondrial_contigs.gff320200513-14FG141-mitochondrial_contigs.gff3.md520200513-14FG141-repeats.gff3 # Repeat and transposable element annotations20200513-14FG141-repeats.gff3.md520200519-14FG141-genes.gff3 # Protein coding, rRNA, and tRNA genes for this assembly.20200519-14FG141-genes.gff3.md520200519-14FG141-proteins.fasta # Extracted protein coding files.20200519-14FG141-proteins.fasta.md520200519-14FG141-transcripts.fasta # Extracted exon nucleotide sequences (note that this are not CDSs).20200519-14FG141-transcripts.fasta.md5CHANGELOG.txt # Details of changes and how files were created. Transposable elements and repeats were predicted using the [PanTE](https://github.com/darcyabjones/pante) pipeline.Protein coding genes were predicted from softmasked genomes using the [panann pipeline](https://github.com/darcyabjones/panann). Protein coding genes overlapping rRNA genes by more tha 50% of their length were excluded. Protein coding genes with exons overlapping genome gaps (stretches of N >= 100bp)were split into fragments, annotated in the GFF with the attribute `fragmented=true`.Note that we had some protein genes looked a bit dubious (lots of short exons).We attempted to mark these based on what support they have in the gff withthe attribute `low_confidence_prediction=true`. This attribute can be manuallyremoved if the gene looks fine to you. The rRNA genes are predicted using RNAmmer v1.2, with some predictions comingfrom repeatmasker.The tRNA genes are predicted using tRNAScan-SE v2.0.3. The SN15 annotations are an updated version of the ones published in Bertazzoni et al. (https://doi.org/10.1186/s12864-021-07699-8).Newly predicted genes were added as a "C" set (in addition to the existing A, and B sets) if they didn't overlap an existing annotation in the same strand by more than 20% of the length of the previous annotation annotation.All protein coding gene annotations have an attribute `confidence_set=` whichindicates the A, B (previous), or C (updated) gene predictions.
针对所分析的诺尔姆平脐蠕孢(Parastagonospora nodorum)分离株的基因、重复序列及转座元件(transposable element)预测结果已整理于此。针对以SN15为参考的P. nodorum泛基因组的短变异预测结果,也已以VCF格式(Variant Call Format)提供(combined.vcf.gz)。其中包括SN15在内的诸多分离株,已通过以下生物项目提交至美国国家生物技术信息中心(National Center for Biotechnology Information, NCBI):https://www.ncbi.nlm.nih.gov/bioproject/PRJNA612761 与 https://www.ncbi.nlm.nih.gov/bioproject/PRJNA686477。由于NCBI的提交要求,本次提供的基因组组装序列与NCBI提交版本略有差异,但与本研究手稿中所使用的组装序列完全一致。
文件changes.tsv 记载了Illumina测序分离株的支架拆分及基因移除细节;文件gene_changes.tsv 则进一步详述了Illumina测序分离株中被移除的基因信息。针对SN15的主要修改(除移除低置信度基因外)包括:从7号染色体5'端移除了100bp的N碱基序列(这将导致所有基因组特征整体偏移100bp),以及1个基因因内部出现终止密码子而被假基因化。
Richards 2018(https://doi.org/10.1534/g3.117.300462;涉及分离株SN4、SN2000、SN79)与Syme 2018(https://doi.org/10.1093/gbe/evy192;涉及分离株RSID*)相关的测序分离株仅在此处提供,原因是我们无法管控NCBI上的相关条目,且目前这些基因组暂无公开的注释信息。
每个压缩文件夹解压后将包含如下格式的文件:
20200513-14FG141-genome.fasta —— 实际的基因组组装序列
20200513-14FG141-genome.fasta.md5 —— 校验文件
20200513-14FG141-genome_contigs.gff3 —— 重叠群到支架的映射注释文件(GFF3格式,General Feature Format version 3)
20200513-14FG141-genome_contigs.gff3.md5 —— 校验文件
20200513-14FG141-genome_softmasked.fasta —— 用于后续基因预测的软屏蔽版组装序列
20200513-14FG141-genome_softmasked.fasta.md5 —— 校验文件
20200513-14FG141-mitochondrial.fasta —— 该分离株的线粒体基因组组装序列
20200513-14FG141-mitochondrial.fasta.md5 —— 校验文件
20200513-14FG141-mitochondrial_contigs.gff3 —— 线粒体重叠群注释文件
20200513-14FG141-mitochondrial_contigs.gff3.md5 —— 校验文件
20200513-14FG141-repeats.gff3 —— 重复序列与转座元件注释文件
20200513-14FG141-repeats.gff3.md5 —— 校验文件
20200519-14FG141-genes.gff3 —— 该组装的蛋白编码基因、rRNA基因及tRNA基因注释文件
20200519-14FG141-genes.gff3.md5 —— 校验文件
20200519-14FG141-proteins.fasta —— 提取得到的蛋白编码序列文件
20200519-14FG141-proteins.fasta.md5 —— 校验文件
20200519-14FG141-transcripts.fasta —— 提取得到的外显子核苷酸序列(注意:并非CDS序列)
20200519-14FG141-transcripts.fasta.md5 —— 校验文件
CHANGELOG.txt —— 记载文件修改详情与生成流程的说明文件
转座元件与重复序列通过[PanTE](https://github.com/darcyabjones/pante)流程进行预测。蛋白编码基因通过[panann](https://github.com/darcyabjones/panann)流程,从软屏蔽的基因组序列中预测得到。与rRNA基因重叠长度超过自身50%的蛋白编码基因将被排除。外显子与基因组间隙(连续N碱基长度≥100bp)存在重叠的蛋白编码基因将被拆分为片段,并在GFF文件中以属性`fragmented=true`进行标注。
部分蛋白基因存在疑似异常(如存在大量短外显子),我们将根据其GFF注释中的支持证据,以属性`low_confidence_prediction=true`进行标记。若用户认为某基因注释无误,可手动移除该属性。rRNA基因通过RNAmmer v1.2进行预测,部分预测结果来自RepeatMasker。tRNA基因通过tRNAScan-SE v2.0.3进行预测。
SN15的注释版本是对Bertazzoni等人(https://doi.org/10.1186/s12864-021-07699-8)已发表注释的更新版本。若新预测的基因在同一链上与已有注释的重叠长度未超过原有注释的20%,则将其作为“C”集添加至原有A、B集之外。所有蛋白编码基因注释均带有`confidence_set=`属性,用于标注该基因属于A、B(原有)还是C(更新)预测集。
创建时间:
2023-06-28



