1. Ecological genomics of the Northern krill: Genome assembly DNA sequences
收藏Figshare2024-03-28 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/1_Ecological_genomics_of_the_Northern_krill_Genome_assembly_DNA_sequences/22785269
下载链接
链接失效反馈官方服务:
资源简介:
northern_krill.genome_assembly.tar.gz, the major gzipped tar archive that contains seven DNA sequence files in FASTA format. These files represent the finished genome assembly of the Northern krill, produced using the "K20" reference specimen.non_reference_preliminary_mitochondrial_sequence.tar.gz, a minor file with resources used to assemble a preliminary mitochondrial sequence from a non-reference specimen ("K4").README.genecovr_instructions.txt, instructions in a text file for how to use genecovr and GMAP to evaluate the quality of the resulting genome assembly using RNA transcript sequences.1.m_norvegica.main_w_mito.fasta, main genome assembly including mitochondrial chromosome, 216568 sequences, 19.7 Gb.2.m_norvegica.short.fasta, very short assembly fragments below 200 bp, 228 sequences, 27 kb.3.m_norvegica.artefacts.fasta, sequences flagged as artefacts by Purge_Haplotigs due to very low or high mapping depths, 49868 sequences, 556 Mb.4.m_norvegica.haplotigs.fasta, sequences flagged as putative haplotigs by Purge_Haplotigs due to depth similarity to other sequences, 168305 sequences, 1.95 Gb.5.m_norvegica.mitochondrion.fasta, mitochondrial sequence, 1 sequence, 17944 b.6.m_norvegica.mitochondrial_artefacts.fasta, mitochondrial-like sequence and potential assembly artifacts, 8 sequences, 81.6 kb.7.m_norvegica.bacterial.fasta, putative sequences from bacterial contaminants, 113 sequences, 8.7 Mb.seq_s_X: "s" indicates this sequence is a scaffold of contings and contains gaps ("N")seq_c_X: "c" indicates this sequence is a contings and contains no gapsseq_a_X: indicates this is an "artifact" (see above)seq_h_X: indicates this is a "haplotig" (see above)seq_m_X: indicates this is the mitochondrial sequence (see above)seq_r_X: indicates this is a "mitochondrial artifact" sequence (see above)seq_b_X: indicates this is a "bacterial" sequence (see above)File = name of fileSequence = name of sequenceLength = length of sequenceLength no N = length of sequence, not counting NStart = counting incrementally across the whole file, this is the start position of the sequenceEnd = counting incrementally across the whole file, this is the stop position of the sequenceTotal = counting incrementally across the whole file, the total amount of sequence seen at this stagex = the N-level, from 1 to 100.Nx = the N-level, from 1 to 100 and written as N1 to N100.LENGTH[Nx] = the length of the shortest sequence at this leveln[Lx] = the number of of sequences at this levelSUM = counting incrementally across N-levels, the sum of the sequence lengths at this levelBIN_SUM = the sum of sequence lengths for this particular level/binTOT = total length of sequences in the fileSEQS = number of sequences in the fileMAX = the length of the longest sequenceMEAN = the mean length of the sequences This item holds genome assembly reference sequences, i.e. the main output from the genome assembly. Contents: northern_krill.genome_assembly.tar.gz This archive contains the final genome assembly sequences. The most important file of these is "1.m_norvegica.main_w_mito.fasta" which is the main nuclear genome assembly plus the mitochondrial sequence. This is the primary genome assembly resource and was annotated for genes, repeats and DNA methylation. Genome-scale patterns of genetic variation among individuals and populations was measured using this resource as the reference. Archived contents: Sequence naming follow this convention: Each FASTA file contans two accessory files tab-separated spreadsheet files: FASTA.lengths.csv: contains information about the order and lengths of sequences. FASTA.Nx_stats_1.csv: contains info about the length distribution of sequences, for example the N50. The sequences were sorted by length in order to produce these statistics. Four additional statistics are printed as keys and values on the first line: non_reference_preliminary_mitochondrial_sequence.tar.gz An archive with the Nanopore long-reads (FASTQ format) used to produce a preliminary mitochondrial assemly from a non-reference specimen ("K4"), as well as the resulting sequence (FASTA format). In addition, the archive contains the MITOS2 gene annotations for this preliminary assembly.
本数据集包含以下核心文件与内容:
1. **northern_krill.genome_assembly.tar.gz**:主gzip压缩tar归档文件,内含7个FASTA(FASTA)格式的DNA序列文件,代表以“K20”参考样本构建得到的北磷虾完成版基因组组装结果。
2. **non_reference_preliminary_mitochondrial_sequence.tar.gz**:次要资源文件,包含从非参考样本“K4”组装得到预组装线粒体序列所需的相关数据。
3. **README.genecovr_instructions.txt**:文本格式说明文档,指导如何使用genecovr与GMAP(GMAP)工具,基于RNA转录本序列评估所得基因组组装的质量。
### 归档内7个FASTA序列文件详情
1. `1.m_norvegica.main_w_mito.fasta`:包含线粒体染色体的主基因组组装结果,共216568条序列,总长度19.7 Gb。
2. `2.m_norvegica.short.fasta`:长度低于200 bp的超短组装片段,共228条序列,总长度27 kb。
3. `3.m_norvegica.artefacts.fasta`:经Purge_Haplotigs(Purge_Haplotigs)工具标记为人工序列的文件,因比对深度异常(过低或过高)被筛选,共49868条序列,总长度556 Mb。
4. `4.m_norvegica.haplotigs.fasta`:经Purge_Haplotigs工具标记为推定单倍型重叠群(haplotig)的文件,因与其他序列比对深度相似而被识别,共168305条序列,总长度1.95 Gb。
5. `5.m_norvegica.mitochondrion.fasta`:线粒体序列文件,仅1条序列,长度17944 bp。
6. `6.m_norvegica.mitochondrial_artefacts.fasta`:类线粒体序列及潜在组装人工产物,共8条序列,总长度81.6 kb。
7. `7.m_norvegica.bacterial.fasta`:推定的细菌污染序列,共113条序列,总长度8.7 Mb。
### 序列命名规范
- `seq_s_X`:其中“s”表示该序列为由重叠群(contig)构成的支架序列(scaffold),包含间隙碱基“N”
- `seq_c_X`:其中“c”表示该序列为重叠群(contig),不包含间隙碱基
- `seq_a_X`:表示该序列为“人工序列”(详见前文说明)
- `seq_h_X`:表示该序列为“单倍型重叠群”(haplotig,详见前文说明)
- `seq_m_X`:表示该序列为“线粒体序列”(详见前文说明)
- `seq_r_X`:表示该序列为“线粒体人工产物序列”(详见前文说明)
- `seq_b_X`:表示该序列为“细菌序列”(详见前文说明)
### 通用统计字段说明
- `File`:文件名
- `Sequence`:序列名称
- `Length`:序列总长度
- `Length no N`:不计入间隙碱基“N”的序列长度
- `Start`:在整个文件中按递增计数的序列起始位置
- `End`:在整个文件中按递增计数的序列终止位置
- `Total`:当前统计阶段累计的总序列长度
- `x`:N值分级,取值范围为1至100
- `Nx`:N值分级,取值范围为1至100,记为`N1`至`N100`
- `LENGTH[Nx]`:该分级下最短序列的长度
- `n[Lx]`:该分级下的序列数量
- `SUM`:按N值分级累计的该阶段总序列长度
- `BIN_SUM`:当前分级/区间内的总序列长度
- `TOT`:文件内所有序列的总长度
- `SEQS`:文件内的序列总数
- `MAX`:序列的最大长度
- `MEAN`:序列的平均长度
### 主归档补充说明
本数据集的核心为基因组组装参考序列,即基因组组装的最终输出结果。其中`northern_krill.genome_assembly.tar.gz`归档内含最终版基因组组装序列,其中最为关键的文件为`1.m_norvegica.main_w_mito.fasta`,即包含线粒体序列的核基因组主组装结果。该资源为核心基因组组装数据集,已完成基因、重复序列及DNA甲基化的注释工作,同时被用作参考资源,用于检测个体及种群间的全基因组尺度遗传变异模式。
每个FASTA文件均附带两个制表符分隔的电子表格附属文件:
1. `FASTA.lengths.csv`:包含序列的排序信息及长度信息
2. `FASTA.Nx_stats_1.csv`:包含序列长度分布的相关统计信息(例如N50(N50)),所有序列均按长度排序以生成此类统计数据。首行以键值对形式输出4项额外统计指标。
### 非参考预组装线粒体序列归档说明
`non_reference_preliminary_mitochondrial_sequence.tar.gz`归档包含用于从非参考样本“K4”组装得到预组装线粒体序列的Nanopore长读长数据(FASTQ(FASTQ)格式),以及最终得到的预组装序列(FASTA格式)。此外,该归档还包含针对该预组装序列的MITOS2(MITOS2)基因注释结果。
创建时间:
2024-03-28



