five

hazemessam/aav

收藏
Hugging Face2026-01-02 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/hazemessam/aav
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: afl-3.0 tags: - biology --- FLIP AAV Dataset Splits The AAV (Adeno-Associated Virus) dataset in FLIP focuses on predicting the fitness/efficiency of AAV capsid variants for gene therapy. Note: DROP any example that has split=nan when training. The reason for leaving them is to keep this dataset identical to the original one. - low_vs_high - Training set: Sequences with fitness values equal to or below wild type - Test set: Sequences with fitness values above wild type - Purpose: Tests if models can extrapolate from low-performing sequences to predict high-performing ones - one_vs_many - Training set: Sequences with exactly 1 mutation from wild type - Test set: Sequences with many mutations (more than 1) - Purpose: Tests generalization from single mutants to multi-mutant sequences - two_vs_many - Training set: Sequences with ≤2 mutations from wild type - Test set: Sequences with more than 2 mutations - Purpose: Tests if models trained on low-mutation sequences can predict fitness of higher-mutation sequences - seven_vs_many - Training set: Sequences with exactly 7 mutations from wild type - Test set: Sequences with a different number of mutations - Purpose: Tests generalization when training on a specific mutation count - des_mut (Designed vs Mutant) - Training set: Designed sequences (rationally designed variants) - Test set: Random mutants - Purpose: Tests if models trained on designed sequences can predict fitness of random mutants - mut_des (Mutant vs Designed) - Training set: Random mutants - Test set: Designed sequences - Purpose: The reverse - tests if models trained on random mutants can predict designed sequence fitness - sampled - Random 80/20 train-test split - Used as a baseline comparison

许可证:AFL-3.0 标签:生物学 FLIP AAV 数据集划分 FLIP中的AAV(腺相关病毒,Adeno-Associated Virus)数据集专注于预测用于基因治疗的AAV衣壳变体的适配性与效率。 注意:训练时请丢弃所有split=nan的样本。保留此类样本的目的是为了与原始数据集保持一致。 - low_vs_high(低性能vs高性能划分) - 训练集:适配值等于或低于野生型的序列 - 测试集:适配值高于野生型的序列 - 用途:验证模型能否从低性能序列外推以预测高性能序列 - one_vs_many(单突变vs多突变划分) - 训练集:与野生型恰好存在1处突变的序列 - 测试集:存在多处突变(超过1处)的序列 - 用途:验证模型从单突变序列向多突变序列的泛化能力 - two_vs_many(≤2突变vs更多突变划分) - 训练集:与野生型突变数≤2的序列 - 测试集:突变数超过2的序列 - 用途:验证在低突变数序列上训练的模型,能否预测高突变数序列的适配性 - seven_vs_many(7处突变vs其他划分) - 训练集:与野生型恰好存在7处突变的序列 - 测试集:突变数不同的序列 - 用途:验证当模型在特定突变数的序列上训练时的泛化能力 - des_mut(设计序列vs突变序列) - 训练集:设计序列(理性设计的变体) - 测试集:随机突变体 - 用途:验证在设计序列上训练的模型,能否预测随机突变体的适配性 - mut_des(突变序列vs设计序列) - 训练集:随机突变体 - 测试集:设计序列 - 用途:与上述任务相反——验证在随机突变体上训练的模型,能否预测设计序列的适配性 - sampled(随机采样划分) - 采用随机80/20比例的训练测试划分 - 用作基线对比基准
提供机构:
hazemessam
二维码
社区交流群
二维码
科研交流群
商业服务