shannoncoelho/uniprot
收藏Hugging Face2026-04-14 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/shannoncoelho/uniprot
下载链接
链接失效反馈官方服务:
资源简介:
---
liscence: mit
---
# Dataset Description
## Dataset Summary
This dataset is a mirror of the Uniprot/SwissProt database. It contains the names and sequences of >500K proteins.
This dataset was parsed from the FASTA file at https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz.
Supported Tasks and Leaderboards: None
Languages: English
## Dataset Structure
### Data Instances
Data Fields: id, description, sequence
Data Splits: None
## Dataset Creation
The dataset was downloaded and parsed into a `dataset` object and uploaded unchanged.
Initial Data Collection and Normalization: Dataset was downloaded and curated on 03/09/2022.
## Considerations for Using the Data
Social Impact of Dataset: Due to the tendency of HIV to mutate, drug resistance is a common issue when attempting to treat those infected with HIV.
Protease inhibitors are a class of drugs that HIV is known to develop resistance via mutations.
Thus, by providing a collection of protease sequences known to be resistant to one or more drugs, this dataset provides a significant collection of data that could be utilized to perform computational analysis of protease resistance mutations.
Discussion of Biases: Due to the sampling nature of this database, it is predominantly composed genes from "well studied" genomes. This may impact the "broadness" of the genes contained.
## Additional Information:
- Dataset Curators: Will Dampier
- Citation Information: TBA
许可证:MIT
# 数据集说明
## 数据集概述
本数据集为通用蛋白质资源(UniProt)/SwissProt数据库的镜像副本,收录了超过50万个蛋白质的名称与序列。本数据集源自https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz处的FASTA格式文件并完成解析。
支持任务与排行榜:无
语言:英语
## 数据集结构
### 数据实例
数据字段:标识符(id)、描述信息、序列
数据划分:无
## 数据集构建
本数据集下载并解析为数据集对象后,未做修改即完成上传。
初始数据收集与规范化:本数据集于2022年3月9日完成下载与整理。
## 数据使用注意事项
### 数据集的社会影响
由于HIV具有较高的突变倾向,耐药性是HIV感染者临床治疗中常见的难题。蛋白酶抑制剂是一类HIV可通过突变产生耐药性的药物。因此,本数据集收录了对一种或多种药物具有耐药性的蛋白酶序列集合,可为蛋白酶耐药突变的计算分析提供丰富的数据资源。
### 偏差说明
受该数据库采样特性的限制,本数据集主要收录了"研究较为充分"的基因组中的基因,这可能会影响所包含基因的覆盖广度。
## 附加信息
- 数据集策展人:Will Dampier
- 引用信息:待公布
提供机构:
shannoncoelho



