five

Salvia divinorum genome annotation files

收藏
Figshare2026-03-05 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/_i_Salvia_divinorum_i_genome_annotation_files/31524469
下载链接
链接失效反馈
官方服务:
资源简介:
Genome assembly and genome comparisons The third-generation sequencing data of Salvia divinorum were assembled using Hifiasm v0.19.941, resulting in a preliminary assembly size of 574,586,498 bp (410 contigs). The assembly was then deduplicated using Purge_Dups42, yielding a final assembly size of 558,956,549 bp (215 contigs). Hi-C clean data were aligned to the contig sequences, and the contigs were scaffolded using juicer-3Ddna (20190319) 43. A total of 20 super scaffolds (541,872,582 bp) were anchored to 11 chromosomes. The previously published Salvia divinorum genome sequence 43 was downloaded from NCBI (PRJNA1104206), and this version is discontinuous. The NCBI genome version (NCBI.fasta) and our assembled genome version (Salvia_divinorum.chr.fa) were compared using the following software and command lines. Nucmer, delta-filter, show-coords and mummerplot in Mummer74 toolkit were used for the genome comparison. Structural identification of the alignment results was performed by using syri74,75 with command line: syri -c ref_qry.delta.filter.coords -d ref_qry.delta.filter -r NCBI.fasta -q Salvia_divinorum.chr.fa. Structural variation plots were constructed using plotsr76 with the following command line: plotsr --sr syri.out --genomes genomes.txt -H 8 -W 5 -b pdf. Annotation of repetitive DNA sequences Repeat sequence annotation was divided into two main parts: known and de novo prediction. Known prediction involved using a known repeat sequence database to identify repeat regions in the genome. The database used was RepBase (v23.06, https://www.girinst.org/repbase/)77, and the prediction software was RepeatMasker (v4.1.6) 78. De novo prediction involved building a de novo repeat library for the genome using ltr_finder (v1.0.6)79, and RepeatModeler 2.0.5 80. After filtering and classification, the final de novo database was obtained, and RepeatMasker (v4.1.6) 81 was used to annotate repeat regions in the genome. Gene prediction and annotation Gene prediction was performed using MAKER (v3.01.03)82. In the first round, transcriptome and protein data were used for gene prediction, with MAKER parameters set to est2genome=1 and protein2genome=1. From the prediction results, 2,000 structurally complete genes (with start and stop codons, and no premature termination or frameshifts) were randomly selected. In the second round. Augustus and SNAP were used to train parameters using these 2,000 structurally complete genes. The transcriptome prediction results, protein results, and two de novo prediction results were integrated to obtain higher-quality gene model predictions. The datasets used for MAKER annotation in this project included transcriptome data and homologous species protein data (Arabidopsis thaliana, Salvia divinorum, Salvia miltiorrhiza, Sesamum indicum, Salvia hispanica, Salvia splendens, Solanum lycopersicum). The de novo training software used was Augustus (v3.5.0)83, and SNAP (20060728)84. Transcriptome data were first aligned using hisat2 (v2.2.1)85, and then transcripts were constructed using StringTie (v2.2.1)86. Finally, MAKER was used to integrate the results and obtain the final gene set. The annotation of gene function involved comparing the gene set obtained from the gene structure annotation with known protein databases and other libraries using the alignment software Blast (v2.2.31)87, to obtain functional information. The databases used for this method included SwissProt, TrEMBL, and NR. InterProscan (v5.71-102.0)88, and were used to search secondary structure domain databases for information on gene function, using databases such as SUPERFAMILY, NCBIFAM, PRINTS, Pfam, SMART, ProSiteProfiles, and ProSitePatterns. KEGG pathway annotation was performed using eggnog-mapper (v2.1.12)89, with the database emapperdb (v5.0.2). Annotation of non-coding RNA Non-coding RNAs (ncRNAs), which do not translate into proteins, include rRNA, tRNA, snRNA, miRNA, and these RNAs have important biological functions. tRNAscan-SE (v2.0, http://lowelab.ucsc.edu/tRNAscan-SE/)90 was used to identify tRNA sequences in the genome based on their structural features. Due to the high conservation of rRNA, rRNA sequences from closely related species were used as reference sequences, and BLASTN87 was used to identify rRNA sequences in the genome. The covariance models from the Rfam family (v14.10)91, and the INFERNAL (v1.1.5, http://infernal.janelia.org/)92 included in Rfam were used to predict miRNA and snRNA sequences in the genome.
创建时间:
2026-03-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作