Sequence variation in HIV-1 protease and reverse transcriptase genes.

NIAID Data Ecosystem2026-03-08 收录

下载链接：

https://figshare.com/articles/dataset/Sequence_variation_in_HIV_1_protease_and_reverse_transcriptase_genes_/1035070

下载链接

链接失效反馈

官方服务：

资源简介：

Sequence variation at the nucleotide, dinucleotide, codon and amino acid level was assessed in an alignment of 4455 HIV-1 sequences spanning 1017 nucleotides from codon 1 of PR to codon 240 of RT, and representing three major subtypes of the HIV-1 M group (A, B and C). Constraint and polymorphism at the nucleotide, dinucleotide, codon and amino acid level amongst wild-type (i.e. not drug selected) sequences, was evaluated separately for each subtype. Variation was assessed by scoring the frequency of mutations (defined as changes relative to the subtype majority consensus sequence) at all sites across the alignment. Plotting the distribution of nucleotide and amino acid variability reveals a picture of remarkably pervasive constraint at both levels. At the amino acid level, marked conservation reflects strong purifying selection acting on the PR and RT enzymes. Conservation at the amino acid level is reflected at the nucleic acid level, as many nucleotide sites (~55%) occur at non-synonymous positions within conserved amino acid sites. However, synonymous positions are also relatively invariable. Patterns of sequence variation were further characterised by mapping the distribution of conserved and polymorphic sites across the aligned region for each subtype. Amino acid sites were categorised as either conserved (<1% variation), constrained (>1% but <10% variation), or polymorphic (>10% variation). Nucleotide sites were categorised as synonymous or non-synonymous based on the consensus amino acid sequence. Synonymous sites were then further categorised as follows: (i) constrained typical (<10% variation from consensus and representing preferred codon usage for HIV-1, (ii) constrained atypical (<10% variation and representing atypical codon usage for HIV-1, and (iii) polymorphic (>10% variation). Mapping revealed a broadly similar distribution of conserved and polymorphic amino acid sites across all subtypes. At the nucleotide level, conservation at synonymous sites in the alignment largely reflects the pronounced G-A bias. However, a proportion of conserved synonymous nucleotide positions in each subtype (12-13%) do not exhibit typical biases. These conserved, ‘atypical’ synonymous sites are distributed throughout PR and RT, with concentrations occurring within some regions, such as the N-terminus of PR. Strikingly, the distribution of constrained atypical and/or polymorphic sites relative to constrained typical sites is well conserved. In the figure, the location of the PR and RT genes within the pol coding domain of an integrated HIV-1 provirus is shown. The PR-RT coding region is described at several levels of detail. For each gene region, the uppermost level of blocks shows the consensus amino acid sequence of subtype B, with differences in subtypes A and C highlighted beneath. The correspondence of the amino acids sequence to the established domains of the encoded proteins is indicated. The middle level of coloured blocks shows the distribution of conserved and polymorphic amino acids sites in each subtype, according to the amino acid variation key. Lollipops attached to this level indicate the location of drug resistance-associated mutations. Mutations identified as showing significant associations with treatment in this analysis are shown as closed circles, other well-characterised resistance mutations are shown as open circles. The bottom level of smaller coloured blocks shows the distribution of conserved and polymorphic nucleotide sites across the amplified region according to the nucleic acid variation key, and as discussed in the text. Predicted stem loops structures are indicated above this level, as arcs connecting sites consistently predicted as being involved in local base-pairing interactions. For the purposes of clarity, only stem loops identified in two or more subtypes are shown, and these are referenced by numbers corresponding to their details as given in the associated table (available on FigShare). Amino acid coordinates relative to each protein are shown across the top, and nucleotide coordinates relative to the entire analysed region are shown across the bottom. The location of miscellaneous genomic features discussed in the text (poly-lysine and trans-frame regions, and the RT active site) is highlighted

创建时间：

2014-05-23

5,000+

优质数据集

54 个

任务类型

进入经典数据集