five

shannoncoelho/uniprot

收藏
Hugging Face2026-04-14 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/shannoncoelho/uniprot
下载链接
链接失效反馈
官方服务:
资源简介:
--- liscence: mit --- # Dataset Description ## Dataset Summary This dataset is a mirror of the Uniprot/SwissProt database. It contains the names and sequences of >500K proteins. This dataset was parsed from the FASTA file at https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz. Supported Tasks and Leaderboards: None Languages: English ## Dataset Structure ### Data Instances Data Fields: id, description, sequence Data Splits: None ## Dataset Creation The dataset was downloaded and parsed into a `dataset` object and uploaded unchanged. Initial Data Collection and Normalization: Dataset was downloaded and curated on 03/09/2022. ## Considerations for Using the Data Social Impact of Dataset: Due to the tendency of HIV to mutate, drug resistance is a common issue when attempting to treat those infected with HIV. Protease inhibitors are a class of drugs that HIV is known to develop resistance via mutations. Thus, by providing a collection of protease sequences known to be resistant to one or more drugs, this dataset provides a significant collection of data that could be utilized to perform computational analysis of protease resistance mutations. Discussion of Biases: Due to the sampling nature of this database, it is predominantly composed genes from "well studied" genomes. This may impact the "broadness" of the genes contained. ## Additional Information: - Dataset Curators: Will Dampier - Citation Information: TBA

许可证:MIT # 数据集说明 ## 数据集概述 本数据集为通用蛋白质资源(UniProt)/SwissProt数据库的镜像副本,收录了超过50万个蛋白质的名称与序列。本数据集源自https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz处的FASTA格式文件并完成解析。 支持任务与排行榜:无 语言:英语 ## 数据集结构 ### 数据实例 数据字段:标识符(id)、描述信息、序列 数据划分:无 ## 数据集构建 本数据集下载并解析为数据集对象后,未做修改即完成上传。 初始数据收集与规范化:本数据集于2022年3月9日完成下载与整理。 ## 数据使用注意事项 ### 数据集的社会影响 由于HIV具有较高的突变倾向,耐药性是HIV感染者临床治疗中常见的难题。蛋白酶抑制剂是一类HIV可通过突变产生耐药性的药物。因此,本数据集收录了对一种或多种药物具有耐药性的蛋白酶序列集合,可为蛋白酶耐药突变的计算分析提供丰富的数据资源。 ### 偏差说明 受该数据库采样特性的限制,本数据集主要收录了"研究较为充分"的基因组中的基因,这可能会影响所包含基因的覆盖广度。 ## 附加信息 - 数据集策展人:Will Dampier - 引用信息:待公布
提供机构:
shannoncoelho
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作