hazemessam/aav

Name: hazemessam/aav
Creator: hazemessam
Published: 2026-01-02 02:57:39
License: 暂无描述

Hugging Face2026-01-02 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/hazemessam/aav

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: afl-3.0 tags: - biology --- FLIP AAV Dataset Splits The AAV (Adeno-Associated Virus) dataset in FLIP focuses on predicting the fitness/efficiency of AAV capsid variants for gene therapy. Note: DROP any example that has split=nan when training. The reason for leaving them is to keep this dataset identical to the original one. - low_vs_high - Training set: Sequences with fitness values equal to or below wild type - Test set: Sequences with fitness values above wild type - Purpose: Tests if models can extrapolate from low-performing sequences to predict high-performing ones - one_vs_many - Training set: Sequences with exactly 1 mutation from wild type - Test set: Sequences with many mutations (more than 1) - Purpose: Tests generalization from single mutants to multi-mutant sequences - two_vs_many - Training set: Sequences with ≤2 mutations from wild type - Test set: Sequences with more than 2 mutations - Purpose: Tests if models trained on low-mutation sequences can predict fitness of higher-mutation sequences - seven_vs_many - Training set: Sequences with exactly 7 mutations from wild type - Test set: Sequences with a different number of mutations - Purpose: Tests generalization when training on a specific mutation count - des_mut (Designed vs Mutant) - Training set: Designed sequences (rationally designed variants) - Test set: Random mutants - Purpose: Tests if models trained on designed sequences can predict fitness of random mutants - mut_des (Mutant vs Designed) - Training set: Random mutants - Test set: Designed sequences - Purpose: The reverse - tests if models trained on random mutants can predict designed sequence fitness - sampled - Random 80/20 train-test split - Used as a baseline comparison

许可证：AFL-3.0 标签：生物学 FLIP AAV 数据集划分 FLIP中的AAV（腺相关病毒，Adeno-Associated Virus）数据集专注于预测用于基因治疗的AAV衣壳变体的适配性与效率。注意：训练时请丢弃所有split=nan的样本。保留此类样本的目的是为了与原始数据集保持一致。 - low_vs_high（低性能vs高性能划分） - 训练集：适配值等于或低于野生型的序列 - 测试集：适配值高于野生型的序列 - 用途：验证模型能否从低性能序列外推以预测高性能序列 - one_vs_many（单突变vs多突变划分） - 训练集：与野生型恰好存在1处突变的序列 - 测试集：存在多处突变（超过1处）的序列 - 用途：验证模型从单突变序列向多突变序列的泛化能力 - two_vs_many（≤2突变vs更多突变划分） - 训练集：与野生型突变数≤2的序列 - 测试集：突变数超过2的序列 - 用途：验证在低突变数序列上训练的模型，能否预测高突变数序列的适配性 - seven_vs_many（7处突变vs其他划分） - 训练集：与野生型恰好存在7处突变的序列 - 测试集：突变数不同的序列 - 用途：验证当模型在特定突变数的序列上训练时的泛化能力 - des_mut（设计序列vs突变序列） - 训练集：设计序列（理性设计的变体） - 测试集：随机突变体 - 用途：验证在设计序列上训练的模型，能否预测随机突变体的适配性 - mut_des（突变序列vs设计序列） - 训练集：随机突变体 - 测试集：设计序列 - 用途：与上述任务相反——验证在随机突变体上训练的模型，能否预测设计序列的适配性 - sampled（随机采样划分） - 采用随机80/20比例的训练测试划分 - 用作基线对比基准

提供机构：

hazemessam