five

yanwang0129/aav2_capsid_viability

收藏
Hugging Face2026-03-06 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/yanwang0129/aav2_capsid_viability
下载链接
链接失效反馈
官方服务:
资源简介:
--- tags: - biology - protein-sequence - aav - regression pretty_name: "AAV2 Capsid Viability Dataset" size_categories: - 100K<n<1M # 289,736 variants source_datasets: - custom # Derived from Bryant et al. 2021, Nature Biotechnology (https://www.nature.com/articles/s41587-020-00793-4) # Dataset Configurations (assuming one file 'aav2_viability_processed.csv' loaded as 'train') configs: - config_name: default data_files: - split: train # Single file defaults to 'train' split path: aav2_processed.csv # Optional but recommended: Dataset structure information # You can often auto-generate this using: datasets-cli dummy-metadata --repo_id your_username/your_dataset_name dataset_info: features: - name: variable_region_sequence dtype: string - name: source_partition dtype: string - name: viral_selection dtype: float32 # Or float64 if that's the precision used - name: vp1_sequence dtype: string - name: vp2_sequence dtype: string - name: vp3_sequence dtype: string - name: edit_distance_from_wt dtype: int32 config_name: default splits: - name: train num_bytes: 568803013 # <-- FILL IN: Get size of your CSV file in bytes num_examples: 289805 download_size: 568579900 # <-- FILL IN: Size of file(s) to download (usually same as dataset_size for single CSV) dataset_size: 568803013 # <-- FILL IN: Total size on disk (usually same as num_bytes for single CSV) --- # AAV2 Capsid Viability Dataset This dataset contains a preprocessed version of the Adeno-associated virus 2 (AAV2) capsid viability dataset from [Bryant et al. 2021](https://www.nature.com/articles/s41587-020-00793-4), including the full VP1, VP2, and VP3 sequences for each variant. ![](https://www.frontiersin.org/files/Articles/63580/fimmu-05-00009-HTML/image_m/fimmu-05-00009-g001.jpg) ## Description This processed version of the dataset contains 289,805 variants with the following columns: - `variable_region_sequence` (str): The unique amino acid sequence of the variable region for each AAV2 variant - `source_partition` (str): Metadata field indicating the source partition aka method used to generate the variant in the original paper. - `viral_selection` (float): The viral selection of the variant. See [Viral Selection Values](#viral-selection-values) for more details. - `vp1_sequence` (str): The VP1 sequence of the variant. - `vp2_sequence` (str): The VP2 sequence of the variant. - `vp3_sequence` (str): The VP3 sequence of the variant. - `edit_distance_from_wt` (int): The edit distance from the wild type variable region sequence. ### Preprocessing - We removed 57 variants because they contained a premature stop codon (denoted as "*") in the variable region. - We also removed all instances of variants that had a duplicate sequence in the dataset as in all of these cases multiple viral_selection values were present. This resulted in the removal of 7,108 variants. ### Viral Selection Values The experiments conducted in the paper aimed to generate diverse AAV2 capsid variants far beyond natural diversity while maintaining viability, using machine learning guidance. Viability was assessed using a high-throughput assay measuring the ability of each variant sequence to assemble an integral capsid that packages the viral genome. For each variant, the fitness or 'viral selection' was calculated as the log of the ratio of the variant's read counts in the output 'viral library' pool to the read counts in the input 'DNA library' pool: $$ \text{Viral Selection for Variant i} = log(\frac{n_{i,viral}}{n_{i,DNA}}) $$ ![Viral Selection Values](viral_selection_distribution.png) ![Edit Distance from Wild Type vs Viral Selection](edit_distance_from_wt_vs_viral_selection.png) ### VP1, VP2, and VP3 sequences [The AAV2 capsid is a 60-unit multimer made up of a 5:5:50 ratio of three viral proteins: VP1, VP2, and VP3. Through alternative splicing and leaky ribosomal scanning, the same mRNA (mRNA9) can give rise to VP2 and VP3, depending on which start codon is used. VP1 is translated from a distinct transcript initiated by the P5 promoter.](https://viralzone.expasy.org/226). We have provided the full VP1, VP2, and VP3 sequences for each variant in the `vp1_sequence`, `vp2_sequence`, and `vp3_sequence` columns respectively. To recreate the full VP1, VP2, and VP3 sequences for each variant, we collected the following wildtype source sequences: - VP1 Source Sequence: [YP_680426.1](https://www.ncbi.nlm.nih.gov/protein/YP_680426.1?report=fasta) - VP2 Source Sequence: [YP_680427.1](https://www.ncbi.nlm.nih.gov/protein/YP_680427.1?report=fasta) - VP3 Source Sequence: [YP_680428.1](https://www.ncbi.nlm.nih.gov/protein/YP_680428.1?report=fasta) We then replaced the wildtype variable region (`DEEEIRTTNPVATEQYGSVSTNLQRGNR`) in each of these source sequences with the corresponding `variable_region_sequence` for each variant. ## Source Data The sequences in this dataset are sourced from the publication *"Deep diversification of an AAV capsid protein by machine learning"* [(Bryant et al. 2021)](https://www.nature.com/articles/s41587-020-00793-4), which contains 293,574 experimentally characterized variants of a wild-type AAV2 capsid protein sequence. The mutations introduced to create the variants focused on a 28-amino acid segment (residue positions 561-588 of the full VP1 AAV2 capsid protein sequence) which were targeted due to this region being involved in a variety of structural and functional roles. ### Original Dataset Availability GitHub containing the original dataset: [https://github.com/alibashir/aav](https://github.com/alibashir/aav) Source Publication: [Bryant et al. 2021](https://www.nature.com/articles/s41587-020-00793-4) # Citation If you use this dataset in your research, please cite the original publication these variants are sourced from: ```bibtex @article{bryant2021deep, title={"Deep diversification of an AAV capsid protein by machine learning"}, author={Bryant, Drew H and Bashir, Ali and Sinai, Sam and Jain, Nina K and Ogden, Pierce J and Riley, Patrick F and Church, George M and Colwell, Lucy J and Kelsic, Eric D}, journal={Nature Biotechnology}, volume={39}, number={6}, pages={691--696}, year={2021}, publisher={Springer Science and Business Media LLC}, doi={10.1038/s41587-020-00793-4} } ```
提供机构:
yanwang0129
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作