Updated short variant, gene and transposable element predictions for P. nodorum isolates.

Name: Updated short variant, gene and transposable element predictions for P. nodorum isolates.
Creator: figshare
Published: 2021-11-02 06:12:20
License: 暂无描述

DataCite Commons2021-11-02 更新2024-08-17 收录

下载链接：

https://figshare.com/articles/dataset/Updated_gene_and_transposable_element_predictions_for_P_nodorum_isolates_SN15_SN4_SN79_and_SN2000/13340975

下载链接

链接失效反馈

官方服务：

资源简介：

Gene, repeat, and transposable element predictions for the Parastagonospora nodorum isolates analysed. Short variant predictions for the P. nodorum pangenome relative to SN15 are also included in VCF format. (combined.vcf.gz) Many of these isolates, including SN15, are also deposited to the NCBI under bioprojects: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA612761, and https://www.ncbi.nlm.nih.gov/bioproject/PRJNA686477. These assemblies are slightly different due to the submission requirements of the NCBI. The assemblies here are the same as the ones analysed in the manuscript.The file changes.tsv contains details of scaffold splits and removed genes from illumina sequenced isolates. The file gene_changes.tsv includes further details of removed genes in the illumina sequenced isolates. The only major changes to SN15 (other than removal of low-confidence genes) are that 100bp of N's was removed from the 5' end of Chromosome 7 (meaning that all features are shifted over by 100bp), and one gene was pseudogenised due to an internal stop codon. The isolates sequenced as part of Richards 2018 (https://doi.org/10.1534/g3.117.300462; SN4, SN2000, SN79) and Syme 2018 (https://doi.org/10.1093/gbe/evy192; RSID*) are deposited here only, as we don't control the NCBI entries and there are currently no publicly available annotations for these genomes. Each zipped folder will unzip to contain something like the following structure: 20200513-14FG141-genome.fasta # The actual assembly.20200513-14FG141-genome.fasta.md520200513-14FG141-genome_contigs.gff3 # Mapping contics to scaffolds.20200513-14FG141-genome_contigs.gff3.md520200513-14FG141-genome_softmasked.fasta # A soft-masked version of the assembly for further gene prediction.20200513-14FG141-genome_softmasked.fasta.md520200513-14FG141-mitochondrial.fasta # The mitochondrial assembly for this isolate.20200513-14FG141-mitochondrial.fasta.md520200513-14FG141-mitochondrial_contigs.gff320200513-14FG141-mitochondrial_contigs.gff3.md520200513-14FG141-repeats.gff3 # Repeat and transposable element annotations20200513-14FG141-repeats.gff3.md520200519-14FG141-genes.gff3 # Protein coding, rRNA, and tRNA genes for this assembly.20200519-14FG141-genes.gff3.md520200519-14FG141-proteins.fasta # Extracted protein coding files.20200519-14FG141-proteins.fasta.md520200519-14FG141-transcripts.fasta # Extracted exon nucleotide sequences (note that this are not CDSs).20200519-14FG141-transcripts.fasta.md5CHANGELOG.txt # Details of changes and how files were created. Transposable elements and repeats were predicted using the [PanTE](https://github.com/darcyabjones/pante) pipeline.Protein coding genes were predicted from softmasked genomes using the [panann pipeline](https://github.com/darcyabjones/panann). Protein coding genes overlapping rRNA genes by more tha 50% of their length were excluded. Protein coding genes with exons overlapping genome gaps (stretches of N >= 100bp)were split into fragments, annotated in the GFF with the attribute `fragmented=true`.Note that we had some protein genes looked a bit dubious (lots of short exons).We attempted to mark these based on what support they have in the gff withthe attribute `low_confidence_prediction=true`. This attribute can be manuallyremoved if the gene looks fine to you. The rRNA genes are predicted using RNAmmer v1.2, with some predictions comingfrom repeatmasker.The tRNA genes are predicted using tRNAScan-SE v2.0.3. The SN15 annotations are an updated version of the ones published in Bertazzoni et al. (https://doi.org/10.1186/s12864-021-07699-8).Newly predicted genes were added as a "C" set (in addition to the existing A, and B sets) if they didn't overlap an existing annotation in the same strand by more than 20% of the length of the previous annotation annotation.All protein coding gene annotations have an attribute `confidence_set=` whichindicates the A, B (previous), or C (updated) gene predictions.

提供机构：

figshare

创建时间：

2020-12-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集