five

SARS-CoV-2 genome alignments

收藏
NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://zenodo.org/record/4694857
下载链接
链接失效反馈
官方服务:
资源简介:
These four files include 27,851 SARS-CoV-2 full genomic sequences that were uploaded to GenBank between January 2020 and January 2021. Sequences were retrieved on four separate occasions, as reflected in their file names. The last download date was January 19, 2021, so the files should include every full-length sequence on GenBank up that point, but the sequence list was filtered to remove duplicates, sequences with too many non-canonical nucleotide calls (e.g., N, R, Y, etc.), and sequences that were truncated on either end beyond a certain minimum length cutoff. In the coding regions, sequences were aligned codon-by-codon to the amino acid string. The first 'sequence' in each alignment file is the amino acid translation of the reference strain. Asterisks denote non-coding nucleotides. Each amino acid is represented by its single-letter code, repeated three times (e.g., methionine = MMM). Stop codons are denoted with XXX. Gaps caused by insertions among the included sequences are denoted by "-" and are carried over into the AA strand. The first real sequence is the reference strain (NC_045512.2, aka "Wuhan-1). The sequences were downloaded and processed in batches: spring 2020, summer 2020, fall 2020, and winter 2021. Thus, the sequence files and the sequences contained therein are roughly in chronological order, but this order represents the date of upload to GenBank, not the infection date or sequencing date. Each file is locally degapped. You will have to handle this yourself if you wish to combine files.
创建时间:
2021-04-16
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作