SARS-CoV-2 genome alignments
收藏NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://zenodo.org/record/4694857
下载链接
链接失效反馈官方服务:
资源简介:
These four files include 27,851 SARS-CoV-2 full genomic sequences that were uploaded to GenBank between January 2020 and January 2021. Sequences were retrieved on four separate occasions, as reflected in their file names. The last download date was January 19, 2021, so the files should include every full-length sequence on GenBank up that point, but the sequence list was filtered to remove duplicates, sequences with too many non-canonical nucleotide calls (e.g., N, R, Y, etc.), and sequences that were truncated on either end beyond a certain minimum length cutoff.
In the coding regions, sequences were aligned codon-by-codon to the amino acid string. The first 'sequence' in each alignment file is the amino acid translation of the reference strain. Asterisks denote non-coding nucleotides. Each amino acid is represented by its single-letter code, repeated three times (e.g., methionine = MMM). Stop codons are denoted with XXX. Gaps caused by insertions among the included sequences are denoted by "-" and are carried over into the AA strand. The first real sequence is the reference strain (NC_045512.2, aka "Wuhan-1).
The sequences were downloaded and processed in batches: spring 2020, summer 2020, fall 2020, and winter 2021. Thus, the sequence files and the sequences contained therein are roughly in chronological order, but this order represents the date of upload to GenBank, not the infection date or sequencing date. Each file is locally degapped. You will have to handle this yourself if you wish to combine files.
创建时间:
2021-04-16



