1. Ecological genomics of the Northern krill: Genome assembly DNA sequences
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://figshare.com/articles/dataset/1_Ecological_genomics_of_the_Northern_krill_Genome_assembly_DNA_sequences/22785269
下载链接
链接失效反馈官方服务:
资源简介:
northern_krill.genome_assembly.tar.gz, the major gzipped tar archive that contains seven DNA sequence files in FASTA format. These files represent the finished genome assembly of the Northern krill, produced using the "K20" reference specimen.non_reference_preliminary_mitochondrial_sequence.tar.gz, a minor file with resources used to assemble a preliminary mitochondrial sequence from a non-reference specimen ("K4").README.genecovr_instructions.txt, instructions in a text file for how to use genecovr and GMAP to evaluate the quality of the resulting genome assembly using RNA transcript sequences.1.m_norvegica.main_w_mito.fasta, main genome assembly including mitochondrial chromosome, 216568 sequences, 19.7 Gb.2.m_norvegica.short.fasta, very short assembly fragments below 200 bp, 228 sequences, 27 kb.3.m_norvegica.artefacts.fasta, sequences flagged as artefacts by Purge_Haplotigs due to very low or high mapping depths, 49868 sequences, 556 Mb.4.m_norvegica.haplotigs.fasta, sequences flagged as putative haplotigs by Purge_Haplotigs due to depth similarity to other sequences, 168305 sequences, 1.95 Gb.5.m_norvegica.mitochondrion.fasta, mitochondrial sequence, 1 sequence, 17944 b.6.m_norvegica.mitochondrial_artefacts.fasta, mitochondrial-like sequence and potential assembly artifacts, 8 sequences, 81.6 kb.7.m_norvegica.bacterial.fasta, putative sequences from bacterial contaminants, 113 sequences, 8.7 Mb.seq_s_X: "s" indicates this sequence is a scaffold of contings and contains gaps ("N")seq_c_X: "c" indicates this sequence is a contings and contains no gapsseq_a_X: indicates this is an "artifact" (see above)seq_h_X: indicates this is a "haplotig" (see above)seq_m_X: indicates this is the mitochondrial sequence (see above)seq_r_X: indicates this is a "mitochondrial artifact" sequence (see above)seq_b_X: indicates this is a "bacterial" sequence (see above)File = name of fileSequence = name of sequenceLength = length of sequenceLength no N = length of sequence, not counting NStart = counting incrementally across the whole file, this is the start position of the sequenceEnd = counting incrementally across the whole file, this is the stop position of the sequenceTotal = counting incrementally across the whole file, the total amount of sequence seen at this stagex = the N-level, from 1 to 100.Nx = the N-level, from 1 to 100 and written as N1 to N100.LENGTH[Nx] = the length of the shortest sequence at this leveln[Lx] = the number of of sequences at this levelSUM = counting incrementally across N-levels, the sum of the sequence lengths at this levelBIN_SUM = the sum of sequence lengths for this particular level/binTOT = total length of sequences in the fileSEQS = number of sequences in the fileMAX = the length of the longest sequenceMEAN = the mean length of the sequences
This item holds genome assembly reference sequences, i.e. the main output from the genome assembly.
Contents:
northern_krill.genome_assembly.tar.gz
This archive contains the final genome assembly sequences. The most important file of these is "1.m_norvegica.main_w_mito.fasta" which is the main nuclear genome assembly plus the mitochondrial sequence. This is the primary genome assembly resource and was annotated for genes, repeats and DNA methylation. Genome-scale patterns of genetic variation among individuals and populations was measured using this resource as the reference.
Archived contents:
Sequence naming follow this convention:
Each FASTA file contans two accessory files tab-separated spreadsheet files:
FASTA.lengths.csv: contains information about the order and lengths of sequences.
FASTA.Nx_stats_1.csv: contains info about the length distribution of sequences, for example the N50. The sequences were sorted by length in order to produce these statistics.
Four additional statistics are printed as keys and values on the first line:
non_reference_preliminary_mitochondrial_sequence.tar.gz
An archive with the Nanopore long-reads (FASTQ format) used to produce a preliminary mitochondrial assemly from a non-reference specimen ("K4"), as well as the resulting sequence (FASTA format). In addition, the archive contains the MITOS2 gene annotations for this preliminary assembly.
northern_krill.genome_assembly.tar.gz:主要压缩归档,内含7个FASTA(FASTA)格式的DNA序列文件,代表采用“K20”参考样本构建的北方磷虾完成版基因组组装结果。
non_reference_preliminary_mitochondrial_sequence.tar.gz:小型辅助文件,包含从非参考样本“K4”组装预线粒体序列所需的相关资源。
README.genecovr_instructions.txt:文本格式操作说明文件,指导如何使用genecovr与GMAP(GMAP)工具,通过RNA转录组序列评估所得基因组组装的质量。
1.m_norvegica.main_w_mito.fasta:包含线粒体染色体的主基因组组装结果,共计216568条序列,总长度19.7 Gb。
2.m_norvegica.short.fasta:长度低于200 bp的超短组装片段,共计228条序列,总长度27 kb。
3.m_norvegica.artefacts.fasta:经Purge_Haplotigs(Purge_Haplotigs)工具标记为人工序列的片段,因比对深度异常偏低或偏高,共计49868条序列,总长度556 Mb。
4.m_norvegica.haplotigs.fasta:经Purge_Haplotigs工具标记为候选单倍型片段(haplotig)的序列,因与其他序列比对深度相似,共计168305条序列,总长度1.95 Gb。
5.m_norvegica.mitochondrion.fasta:线粒体序列,共计1条序列,长度17944 bp。
6.m_norvegica.mitochondrial_artefacts.fasta:类线粒体序列及潜在组装人工产物,共计8条序列,总长度81.6 kb。
7.m_norvegica.bacterial.fasta:推测来自细菌污染物的序列,共计113条序列,总长度8.7 Mb。
序列命名规则如下:
seq_s_X:“s”代表该序列为支架序列(scaffold),由重叠群(conting)构成且包含间隙碱基(“N”)。
seq_c_X:“c”代表该序列为重叠群(conting),不含间隙碱基。
seq_a_X:代表该序列为“人工产物”(详见前文定义)。
seq_h_X:代表该序列为“单倍型片段(haplotig)”(详见前文定义)。
seq_m_X:代表该序列为“线粒体序列”(详见前文定义)。
seq_r_X:代表该序列为“线粒体类人工产物序列”(详见前文定义)。
seq_b_X:代表该序列为“细菌来源序列”(详见前文定义)。
各字段说明:
File = 文件名
Sequence = 序列名称
Length = 序列总长度
Length no N = 不计入“N”碱基的序列有效长度
Start = 基于整个文件的递增计数,代表该序列的起始位置
End = 基于整个文件的递增计数,代表该序列的终止位置
Total = 基于整个文件的递增计数,代表当前统计阶段已累计的总序列长度
x = N碱基占比水平(N-level),取值范围为1~100
Nx = N碱基占比水平(N-level),取值范围为1~100,记为N1至N100
LENGTH[Nx] = 该N水平下最短序列的长度
n[Lx] = 该N水平下的序列条数
SUM = 基于N水平的递增计数,代表该阶段所有序列长度的总和
BIN_SUM = 该特定分组/水平的序列长度总和
TOT = 文件内所有序列的总长度
SEQS = 文件内的序列总数
MAX = 最长序列的长度
MEAN = 序列的平均长度
本数据集包含基因组组装参考序列,即基因组组装的核心输出结果。
数据集内容:
northern_krill.genome_assembly.tar.gz
该归档包含最终版基因组组装序列,其中最为关键的文件为“1.m_norvegica.main_w_mito.fasta”,即整合了线粒体序列的核基因组主组装结果。该资源为本项目的核心基因组组装参考数据集,已完成基因、重复序列及DNA甲基化的注释工作。研究人员以该资源为参考,开展了个体及种群间的全基因组遗传变异模式检测。
归档内包含:
每个FASTA文件均附带两个制表符分隔的电子表格附件:
FASTA.lengths.csv:包含序列的排序信息及各序列的长度数据。
FASTA.Nx_stats_1.csv:包含序列长度分布的相关统计信息(例如N50(N50)统计量),统计前已按序列长度进行排序。
首行额外包含4组以键值对形式呈现的补充统计数据。
non_reference_preliminary_mitochondrial_sequence.tar.gz
该归档包含用于从非参考样本“K4”组装预线粒体序列的Nanopore(Nanopore)长读长测序数据(FASTQ(FASTQ)格式),以及最终得到的组装序列(FASTA格式)。此外,该归档还包含该预组装序列的MITOS2(MITOS2)基因注释结果。
创建时间:
2024-03-28



