five

1. Ecological genomics of the Northern krill: Genome assembly DNA sequences

收藏
DataCite Commons2025-01-15 更新2024-07-13 收录
下载链接:
https://figshare.scilifelab.se/articles/dataset/1_Ecological_genomics_of_the_Northern_krill_Genome_assembly_DNA_sequences/22785269/1
下载链接
链接失效反馈
官方服务:
资源简介:
<strong>northern_krill.genome_assembly.tar.gz</strong>, the major gzipped tar archive that contains seven DNA sequence files in FASTA format. These files represent the finished genome assembly of the Northern krill, produced using the "K20" reference specimen.<strong>non_reference_preliminary_mitochondrial_sequence.tar.gz</strong>, a minor file with resources used to assemble a preliminary mitochondrial sequence from a non-reference specimen ("K4").<strong>README.genecovr_instructions.txt</strong>, instructions in a text file for how to use genecovr and GMAP to evaluate the quality of the resulting genome assembly using RNA transcript sequences.<strong>1.m_norvegica.main_w_mito.fasta</strong>, main genome assembly including mitochondrial chromosome, 216568 sequences, 19.7 Gb.<strong>2.m_norvegica.short.fasta</strong>, very short assembly fragments below 200 bp, 228 sequences, 27 kb.<strong>3.m_norvegica.artefacts.fasta</strong>, sequences flagged as artefacts by Purge_Haplotigs due to very low or high mapping depths, 49868 sequences, 556 Mb.<strong>4.m_norvegica.haplotigs.fasta</strong>, sequences flagged as putative haplotigs by Purge_Haplotigs due to depth similarity to other sequences, 168305 sequences, 1.95 Gb.<strong>5.m_norvegica.mitochondrion.fasta</strong>, mitochondrial sequence, 1 sequence, 17944 b.<strong>6.m_norvegica.mitochondrial_artefacts.fasta</strong>, mitochondrial-like sequence and potential assembly artifacts, 8 sequences, 81.6 kb.<strong>7.m_norvegica.bacterial.fasta</strong>, putative sequences from bacterial contaminants, 113 sequences, 8.7 Mb.<strong>seq_s_X</strong>: "s" indicates this sequence is a scaffold of contings and contains gaps ("N")<strong>seq_c_X</strong>: "c" indicates this sequence is a contings and contains no gaps<strong>seq_a_X</strong>: indicates this is an "artifact" (see above)<strong>seq_h_X</strong>: indicates this is a "haplotig" (see above)<strong>seq_m_X</strong>: indicates this is the mitochondrial sequence (see above)<strong>seq_r_X</strong>: indicates this is a "mitochondrial artifact" sequence (see above)<strong>seq_b_X</strong>: indicates this is a "bacterial" sequence (see above)File = name of fileSequence = name of sequenceLength = length of sequenceLength no N = length of sequence, not counting NStart = counting incrementally across the whole file, this is the start position of the sequenceEnd = counting incrementally across the whole file, this is the stop position of the sequenceTotal = counting incrementally across the whole file, the total amount of sequence seen at this stagex = the N-level, from 1 to 100.Nx = the N-level, from 1 to 100 and written as N1 to N100.LENGTH[Nx] = the length of the shortest sequence at this leveln[Lx] = the number of of sequences at this levelSUM = counting incrementally across N-levels, the sum of the sequence lengths at this levelBIN_SUM = the sum of sequence lengths for this particular level/binTOT = total length of sequences in the fileSEQS = number of sequences in the fileMAX = the length of the longest sequenceMEAN = the mean length of the sequences This item holds genome assembly reference sequences, i.e. the main output from the genome assembly. <strong>Contents:</strong> <strong>northern_krill.genome_assembly.tar.gz</strong> This archive contains the final genome assembly sequences. <strong>The most important file of these is "1.m_norvegica.main_w_mito.fasta"</strong> which is the main nuclear genome assembly plus the mitochondrial sequence. This is the primary genome assembly resource and was annotated for genes, repeats and DNA methylation. Genome-scale patterns of genetic variation among individuals and populations was measured using this resource as the reference. <strong>Archived contents:</strong> Sequence naming follow this convention: Each FASTA file contans two accessory files tab-separated spreadsheet files: <strong>FASTA</strong>.lengths.csv: contains information about the order and lengths of sequences. <strong>FASTA</strong>.Nx_stats_1.csv: contains info about the length distribution of sequences, for example the N50. The sequences were sorted by length in order to produce these statistics. Four additional statistics are printed as keys and values on the first line: <strong>non_reference_preliminary_mitochondrial_sequence.tar.gz</strong> An archive with the Nanopore long-reads (FASTQ format) used to produce a preliminary mitochondrial assemly from a non-reference specimen ("K4"), as well as the resulting sequence (FASTA format). In addition, the archive contains the MITOS2 gene annotations for this preliminary assembly.

**northern_krill.genome_assembly.tar.gz**:北磷虾基因组组装压缩包(northern_krill.genome_assembly.tar.gz),该文件为核心gzip压缩归档,内含7个FASTA格式的DNA序列文件,代表以"K20"参考样本构建的北磷虾完成版基因组组装结果。 **non_reference_preliminary_mitochondrial_sequence.tar.gz**:非参考样本初步线粒体序列压缩包(non_reference_preliminary_mitochondrial_sequence.tar.gz),该文件为次要附属资源包,包含用于从非参考样本"K4"组装初步线粒体序列的相关数据。 **README.genecovr_instructions.txt**:基因覆盖度分析操作指南(README.genecovr_instructions.txt),该文本文件详述了如何使用genecovr与GMAP工具,结合RNA转录序列评估所得基因组组装的质量。 **1.m_norvegica.main_w_mito.fasta**:主基因组组装文件(含线粒体染色体),包含216568条序列,总长度19.7 Gb。 **2.m_norvegica.short.fasta**:超短组装片段文件(长度低于200 bp),包含228条序列,总长度27 kb。 **3.m_norvegica.artefacts.fasta**:人工序列过滤文件,经Purge_Haplotigs工具标记为组装人工产物(因比对深度异常偏低或偏高),包含49868条序列,总长度556 Mb。 **4.m_norvegica.haplotigs.fasta**:单倍型序列文件,经Purge_Haplotigs工具标记为推定单倍型(因与其他序列比对深度相似),包含168305条序列,总长度1.95 Gb。 **5.m_norvegica.mitochondrion.fasta**:线粒体序列文件,包含1条序列,长度17944 bp。 **6.m_norvegica.mitochondrial_artefacts.fasta**:线粒体类人工序列文件,包含线粒体相似序列与潜在组装人工产物,共8条序列,总长度81.6 kb。 **7.m_norvegica.bacterial.fasta**:推定细菌污染序列文件,包含113条序列,总长度8.7 Mb。 - **seq_s_X**:其中"s"表示该序列为带有间隙("N"碱基)的支架序列(scaffold) - **seq_c_X**:其中"c"表示该序列为无间隙的重叠群序列(contig) - **seq_a_X**:表示该序列为"人工序列"(详见前文说明) - **seq_h_X**:表示该序列为"单倍型序列"(详见前文说明) - **seq_m_X**:表示该序列为"线粒体序列"(详见前文说明) - **seq_r_X**:表示该序列为"线粒体人工序列"(详见前文说明) - **seq_b_X**:表示该序列为"细菌污染序列"(详见前文说明) 以下为相关统计字段的说明: - File:文件名 - Sequence:序列名称 - Length:序列总长度 - Length no N:不计入"N"碱基的序列有效长度 - Start:在整个文件中按递增计数的序列起始位置 - End:在整个文件中按递增计数的序列终止位置 - Total:当前统计阶段内,整个文件已累计的序列总长度 - x:N级阈值,取值范围为1至100 - Nx:N级阈值,记为N1至N100 - LENGTH[Nx]:该阈值下最短序列的长度 - n[Lx]:该阈值下的序列数量 - SUM:按N级阈值累计的当前阶段序列总长度 - BIN_SUM:当前层级/分箱的序列总长度 - TOT:文件中所有序列的总长度 - SEQS:文件中的序列总数 - MAX:最长序列的长度 - MEAN:序列的平均长度 本数据集包含基因组组装参考序列,即基因组组装的核心输出结果。 **Contents:** 内容清单: **northern_krill.genome_assembly.tar.gz** 该归档文件包含最终版基因组组装序列。其中最为核心的文件为"1.m_norvegica.main_w_mito.fasta",即主核基因组组装序列加上线粒体序列。本文件为核心基因组组装资源,已完成基因、重复序列与DNA甲基化的注释工作。研究人员以该资源作为参考基因组,对个体及种群间的基因组尺度遗传变异模式进行了测定。 **Archived contents:** 归档附属内容: 序列命名遵循以下规范: 每个FASTA文件均附带两个制表符分隔的电子表格辅助文件: **FASTA**.lengths.csv:包含序列的排序信息与长度数据。 **FASTA**.Nx_stats_1.csv:包含序列长度分布的相关统计指标(例如N50),统计前已按序列长度进行排序。 首行以键值对形式打印了四项额外统计指标: **non_reference_preliminary_mitochondrial_sequence.tar.gz** 该归档文件包含用于从非参考样本"K4"组装初步线粒体序列的纳米孔长读长测序数据(FASTQ格式),以及最终得到的线粒体序列(FASTA格式)。此外,该归档还包含针对该初步组装结果的MITOS2基因注释信息。
提供机构:
Uppsala University
创建时间:
2024-03-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作