Initial consensus sequences
收藏DataCite Commons2022-04-14 更新2024-07-29 收录
下载链接:
https://figshare.com/articles/dataset/Initial_consensus_sequences/19596589
下载链接
链接失效反馈官方服务:
资源简介:
we first prepared a query library by clustering the 165 RT amino acid sequences of Gypsy and Copia reference elements taken from Gypsy database version 2 (Llorens et al., 2011) at a depth of 40% identity and we selected one representative sequence per cluster (n=35). We then used the resulting RT library to search for homologous regions in target genomes with tBLASTn from ncbiblast+ package. All overlapping hits on genomes were merged and the corresponding fasta sequences within the expected size range (520-840 bp) were extracted (n=). To avoid sparing unnecessary computational time, during following steps, RT sequences from each species were clustered at a threshold of 95% identity using MMseqs2 (Steinegger and Soding, 2017) and a maximum of 50 sequences per group were selected for downward analysis. The genomic positions of RT coding regions were extended of 5 kb upstream and downstream and the corresponding sequences were extracted (n=). Extended hits were then clustered using mmseqs2 (with parameters -c 0.5 --max-seq-len 15000), and the groups containing at least 5 sequences were aligned with MAFFT (Katoh et al., 2002). A consensus sequence was then generated for each sequence alignment through the modules “msa2profile” (with parameters --match-mode 1 --match-ratio 0.5) and “profile2consensus“, resulting in 25,565 consensus sequences. To address the fraction of the consensus sequences representing LTR retrotransposons, we compared each one to a library of reference aa RT sequences from Copia, Gypsy, DIRS, endogenous retroviruses, Caulimoviridae, and LINEs using BLASTx. The consensus corresponding to LTR retrotransposons were identified from their best hit (highest bit score) against RT from Copia or Gypsy.
提供机构:
figshare
创建时间:
2022-04-14



