Sequences and annotations of a provisional genome draft of a Senegalese sole female (Sosen1) and a male (Sse05_10M)

Figshare2020-06-12 更新2026-04-28 收录

下载链接：

https://figshare.com/articles/dataset/Sequences_and_annotations_of_a_provisional_genome_draft_of_a_Senegalese_sole_female/12472100

下载链接

链接失效反馈

官方服务：

资源简介：

Information as in 2018 of a female Senegalese sole genome (Sosen1) after Nanopore sequencing. Unzip the archive 1) Sosen1_genome_draft.zip to find: • Sosen1_genome_scaffolds.fasta containing every contig and scaffold identifier and sequence in fasta format. • Sosen1_genome_annotation.gff3 corresponding to a provisional annotation of genome contigs and scalffolds from (1) using MAKER2 and transcript sequences in SOLSEv5.0. • Sosen1_maker.transcripts.fasta containing the deduced transcripts from the gff3 annotation file. • Sosen1_maker.proteins.fasta containing the deduced amino acid sequence for all transcripts from (3). • Sosen1_maker.proteins_annotation.tsv containing a complete annotation of (3) and (4) performed with our software Full-LengtherNext. This includes transcript and protein lengths, best UniProtKB orthologue with identity % and E-value, structural status, open reading frame location in the transcript, description, GOs, KEGG codes, InterPro IDs, Pfam, EC and Unipathway, as tab-separated values (tsv format).The Sosen1 (or SENf1A) female genome was reannotated in 2020. Data are in the file2) Sosen1_female_reannotation_2020.zip that once unzipped provides the following files:• SENf1A.gff3.gz --> gff3 file with the protein coding annotation• SSENf1A.stats.txt.gz --> Stats of the protein-coding annotation• SSENf1A.transcripts.fa.gz --> multifasta file with the protein-coding annotated transcripts• SSENf1A.pep.fa.gz --> aminoacid sequence of the annotated proteins• SSENf1A.cds.fa.gz --> nucleotide sequence of the annotated proteins• SSENf1A.longestpeptide.fa.gz --> aminoacid sequence of the longest protein annotated for each gene• SSENf1ncA.gff3.gz --> gff3 file with the non-coding annotation• SSENf1ncA.transcripts.fa.gz --> multifasta file with the non-coding transcriptsInformation as in 2020 of a male Senegalese sole genome Sse05_10M (or Sosen2 or SSENm1B) after a hybrid sequencing an assembling. 3) Sosen2_male_genome_scaffolds.fasta contain the genome scaffolds4) Sosen2_annotations.zip contains the male genome integrated with genetic markers to provide linkage groups as chromosome surrogates, as well as gene annotations in the following files:• Male_LA_Total.fasta.gz --> male genome assembly• SSENm1B.gff3.fz --> gff3 file with the protein coding annotation• SSENm1B.stats.txt.gz --> Stats of the protein-coding annotation• SSENm1B.transcripts.fa.gz --> multifasta file with the protein-coding annotated transcripts• SSENm1B.pep.fa.gz --> aminoacid sequence of the annotated proteins• SSENm1B.cds.fa.gz --> nucleotide sequence of the annotated proteins• SSENm1B.longestpeptide.fa.gz --> aminoacid sequence of the longest protein annotated for each gene• SSENm1ncB.gff3.gz --> gff4 file with the non-coding annotation• SSENm1ncB.transcripts.fa.gz --> multifasta file with the non-coding transcripts

本数据集收录2018年经纳米孔（Nanopore）测序得到的雌性塞内加尔鳎基因组（Sosen1）相关数据。请解压归档文件1）Sosen1_genome_draft.zip，解压后可获取以下文件： 1. Sosen1_genome_scaffolds.fasta：以FASTA格式存储所有重叠群（contig）与支架（scaffold）的标识符及序列的文件。 2. Sosen1_genome_annotation.gff3：基于步骤1中的重叠群与支架序列，使用MAKER2软件结合SOLSEv5.0转录组序列进行临时注释得到的GFF3格式注释文件。 3. Sosen1_maker.transcripts.fasta：由上述GFF3注释文件推导得到的转录本序列FASTA文件。 4. Sosen1_maker.proteins.fasta：由步骤3中所有转录本推导得到的氨基酸序列FASTA文件。 5. Sosen1_maker.proteins_annotation.tsv：使用自研软件Full-LengtherNext对步骤3、4的序列进行完整注释得到的TSV格式文件，包含转录本与蛋白质长度、最优UniProtKB同源蛋白的相似度百分比与E值、结构状态、转录本中的开放阅读框位置、功能描述、基因本体（Gene Ontology，GO）条目、KEGG编号、InterPro编号、Pfam家族、酶学委员会（Enzyme Commission，EC）编号以及Unipathway通路信息。该雌性塞内加尔鳎基因组Sosen1（亦称为SENf1A）已于2020年完成重新注释，相关数据存储于归档文件2）Sosen1_female_reannotation_2020.zip中，解压后可得到以下文件： • SENf1A.gff3.gz：包含蛋白编码基因注释的GFF3格式压缩文件 • SSENf1A.stats.txt.gz：蛋白编码基因注释统计压缩文件 • SSENf1A.transcripts.fa.gz：包含蛋白编码注释转录本的多序列FASTA压缩文件 • SSENf1A.pep.fa.gz：注释蛋白质的氨基酸序列压缩文件 • SSENf1A.cds.fa.gz：注释蛋白质的编码序列（Coding Sequence，CDS）核苷酸序列压缩文件 • SSENf1A.longestpeptide.fa.gz：每个基因对应的最长注释蛋白质氨基酸序列压缩文件 • SSENf1ncA.gff3.gz：包含非编码RNA注释的GFF3格式压缩文件 • SSENf1ncA.transcripts.fa.gz：包含非编码转录本的多序列FASTA压缩文件本数据集同时收录2020年经混合测序与组装得到的雄性塞内加尔鳎基因组Sse05_10M（亦称为Sosen2或SSENm1B）相关数据。3）Sosen2_male_genome_scaffolds.fasta：存储该雄性基因组支架序列的文件。 4）Sosen2_annotations.zip：该归档整合了遗传标记以构建作为染色体替代物的连锁群，并包含以下基因注释文件： • Male_LA_Total.fasta.gz：雄性基因组组装序列压缩文件 • SSENm1B.gff3.fz：包含蛋白编码基因注释的GFF3格式压缩文件 • SSENm1B.stats.txt.gz：蛋白编码基因注释统计压缩文件 • SSENm1B.transcripts.fa.gz：包含蛋白编码注释转录本的多序列FASTA压缩文件 • SSENm1B.pep.fa.gz：注释蛋白质的氨基酸序列压缩文件 • SSENm1B.cds.fa.gz：注释蛋白质的编码序列（Coding Sequence，CDS）核苷酸序列压缩文件 • SSENm1B.longestpeptide.fa.gz：每个基因对应的最长注释蛋白质氨基酸序列压缩文件 • SSENm1ncB.gff3.gz：包含非编码RNA注释的GFF4格式压缩文件 • SSENm1ncB.transcripts.fa.gz：包含非编码转录本的多序列FASTA压缩文件

创建时间：

2020-06-12