five

ethan0222/HIV_PI

收藏
Hugging Face2026-04-30 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/ethan0222/HIV_PI
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集来源于斯坦福HIV基因型-表型数据库,包含1,733个HIV蛋白酶序列。其中约一半的序列对至少一种抗逆转录病毒治疗(ART)具有抗性。数据集的结构包括数据实例、数据字段和数据创建过程。数据字段包括ID、序列、fold、FPV、IDV、NFV和SQV。数据集的创建目的是为了训练一个模型(HIV-BERT-PI),用于预测HIV蛋白酶序列是否会导致对某些抗逆转录病毒(ART)药物的抗性。此外,数据集的社会影响在于HIV的突变倾向导致药物抗性是一个常见问题,而蛋白酶抑制剂是一类HIV已知通过突变产生抗性的药物。因此,该数据集提供了一个重要的数据集合,可用于进行蛋白酶抗性突变的计算分析。数据集的偏差主要在于采样主要来自北美和欧洲的B亚型序列,而C、A和D亚型的贡献较少,因此在使用时需要考虑对非B亚型序列的补充。

This dataset was derived from the Stanford HIV Genotype-Phenotype database and contains 1,733 HIV protease sequences. Approximately half of the sequences are resistant to at least one antiretroviral therapeutic (ART). The dataset structure includes data instances, data fields, and the data creation process. Data fields include ID, sequence, fold, FPV, IDV, NFV, and SQV. The dataset was curated to train a model (HIV-BERT-PI) designed to predict whether an HIV protease sequence would result in resistance to certain antiretroviral (ART) drugs. The social impact of the dataset is that due to the tendency of HIV to mutate, drug resistance is a common issue when attempting to treat those infected with HIV. Protease inhibitors are a class of drugs that HIV is known to develop resistance via mutations. Thus, by providing a collection of protease sequences known to be resistant to one or more drugs, this dataset provides a significant collection of data that could be utilized to perform computational analysis of protease resistance mutations. The bias in the dataset is that it is predominantly composed of subtype B sequences from North America and Europe with only minor contributions of Subtype C, A, and D, so refinement with additional sequences is needed to perform well on non-B sequences.
提供机构:
ethan0222
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作