hazemessam/aav
收藏Hugging Face2026-01-02 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/hazemessam/aav
下载链接
链接失效反馈官方服务:
资源简介:
---
license: afl-3.0
tags:
- biology
---
FLIP AAV Dataset Splits
The AAV (Adeno-Associated Virus) dataset in FLIP focuses on predicting the fitness/efficiency of AAV capsid variants for gene therapy.
Note: DROP any example that has split=nan when training. The reason for leaving them is to keep this dataset identical to the original one.
- low_vs_high
- Training set: Sequences with fitness values equal to or below wild type
- Test set: Sequences with fitness values above wild type
- Purpose: Tests if models can extrapolate from low-performing sequences to predict high-performing ones
- one_vs_many
- Training set: Sequences with exactly 1 mutation from wild type
- Test set: Sequences with many mutations (more than 1)
- Purpose: Tests generalization from single mutants to multi-mutant sequences
- two_vs_many
- Training set: Sequences with ≤2 mutations from wild type
- Test set: Sequences with more than 2 mutations
- Purpose: Tests if models trained on low-mutation sequences can predict fitness of higher-mutation sequences
- seven_vs_many
- Training set: Sequences with exactly 7 mutations from wild type
- Test set: Sequences with a different number of mutations
- Purpose: Tests generalization when training on a specific mutation count
- des_mut (Designed vs Mutant)
- Training set: Designed sequences (rationally designed variants)
- Test set: Random mutants
- Purpose: Tests if models trained on designed sequences can predict fitness of random mutants
- mut_des (Mutant vs Designed)
- Training set: Random mutants
- Test set: Designed sequences
- Purpose: The reverse - tests if models trained on random mutants can predict designed sequence fitness
- sampled
- Random 80/20 train-test split
- Used as a baseline comparison
许可证:AFL-3.0
标签:生物学
FLIP AAV 数据集划分
FLIP中的AAV(腺相关病毒,Adeno-Associated Virus)数据集专注于预测用于基因治疗的AAV衣壳变体的适配性与效率。
注意:训练时请丢弃所有split=nan的样本。保留此类样本的目的是为了与原始数据集保持一致。
- low_vs_high(低性能vs高性能划分)
- 训练集:适配值等于或低于野生型的序列
- 测试集:适配值高于野生型的序列
- 用途:验证模型能否从低性能序列外推以预测高性能序列
- one_vs_many(单突变vs多突变划分)
- 训练集:与野生型恰好存在1处突变的序列
- 测试集:存在多处突变(超过1处)的序列
- 用途:验证模型从单突变序列向多突变序列的泛化能力
- two_vs_many(≤2突变vs更多突变划分)
- 训练集:与野生型突变数≤2的序列
- 测试集:突变数超过2的序列
- 用途:验证在低突变数序列上训练的模型,能否预测高突变数序列的适配性
- seven_vs_many(7处突变vs其他划分)
- 训练集:与野生型恰好存在7处突变的序列
- 测试集:突变数不同的序列
- 用途:验证当模型在特定突变数的序列上训练时的泛化能力
- des_mut(设计序列vs突变序列)
- 训练集:设计序列(理性设计的变体)
- 测试集:随机突变体
- 用途:验证在设计序列上训练的模型,能否预测随机突变体的适配性
- mut_des(突变序列vs设计序列)
- 训练集:随机突变体
- 测试集:设计序列
- 用途:与上述任务相反——验证在随机突变体上训练的模型,能否预测设计序列的适配性
- sampled(随机采样划分)
- 采用随机80/20比例的训练测试划分
- 用作基线对比基准
提供机构:
hazemessam



