InstaDeepAI/CoVUniBind
收藏Hugging Face2025-12-02 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/InstaDeepAI/CoVUniBind
下载链接
链接失效反馈官方服务:
资源简介:
---
tags:
- antibody
- antigen
- binding
- COVID-19
- SARS-CoV-2
- benchmark
pretty_name: CoV-UniBind
---
<div align="center">
<h1>CoV-UniBind - Coronavirus Unified Binding Database</h1>
</div>
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/681b661257cfafb5ead152d1/WvLoDuemHIwY8bQfOMahb.png" alt="Description" width="300"/>
</p>
**CoV-UniBind** curates and integrates structural and biochemical data on antibodies specifically elicited by SARS-CoV-2 and
related coronaviruses. It links 3D antibody structures to their binding properties across viral variants, incorporating
epitope and sequence information. This dataset serves as a comprehensive resource for analysing antibody-antigen
interactions in the context of COVID-19 and provides a foundation for binding classification and regression tasks using
deep learning models.
## Database Overview
| Dataset | Description | Label | Type | References (doi) |
|------|-------------|-------|-----| ------- |
| covabdab | Binding labels from the Coronavirus Antibody Database (CoV-AbDab) | binding | bool | 10.1093/bioinformatics/btaa739 |
| dms_bloom | Deep Mutational Scanning escape data from Greaney et al. 2022 | escape | float | 10.1038/s41564-021-00972-2; 10.1016/j.chom.2020.11.007; 10.1038/s41467-021-24435-8; 10.1016/j.xcrm.2021.100255; 10.1126/science.abf9302; 10.1038/s41467-021-24435-8; 10.1038/s41564-021-00972-2; 10.1038/s41586-021-04385-3; 10.1038/s41586-022-04980-y; 10.1038/s41586-021-03817-4; 10.1038/s41586-021-03807-6 |
| dms_cao | Deep Mutational Scanning escape data from Cao et al. 2023 | escape | float | 10.1038/s41586-022-05644-7 |
| jian_elisa | ELISA antibody IC50 measurements from Jian et al. 2025 | IC50 | float | 10.1038/s41586-024-08315-x |
| spr | Surface Plasmon Resonance antibody affinity measurements from multiple sources | KD | float | 10.1038/s41467-024-54916-5; 10.1038/s41586-020-2380-z; 10.1038/s41467-021-24514-w; 10.1016/j.immuni.2022.06.005; 10.1016/j.xcrm.2023.100991 |
| drdb | Neutralisation potency data from the SARS-CoV-2 Resistance Database (DRDB) | IC50 | float | 10.1371/journal.pone.0261045 |
## Database Structure
```
.
├── antibody_info
│ └── antibody_synonyms.csv
├── data
│ ├── covabdab_binding.parquet
│ │ └── structures
│ │ ├── processed.zip
│ │ └── trimmed.zip
│ ├── dms_bloom_ab_escape.parquet
│ ├── dms_cao_ab_escape.parquet
│ ├── drdb_binding_potency.parquet
│ ├── jian_elisa_ab_ic50.parquet
│ └── spr_ab_affinity.parquet
├── scores
│ ├── covabdab_binding_scores.parquet
│ ├── dms_bloom_ab_escape_scores.parquet
│ ├── dms_cao_ab_escape_scores.parquet
│ ├── drdb_binding_potency_scores.parquet
│ ├── jian_elisa_ab_ic50_scores.parquet
│ └── spr_ab_affinity_scores.parquet
├── cov-unibind.py
└── README.md
```
## Usage Guide
```python
from datasets import load_dataset
data='drbd' # specify the dataset name based on table above
dataset = load_dataset("InstaDeepAI/cov-unibind",name=data)
```
## Dataset Schema
The table below includes information about the columns contained in the datasets.
| Column Name | Description | Type | Nullable | Example |
|---|---|---|---|---|
| `antibody_name` | Name of the antibody | *str* | False | `bd30_515;bd_515` |
| `antigen_lineage` | Antigen lineage | *str* | False | `BA.1` |
| `target_value` | Experimental binding value| *float* or *bool* | False | `-2.327902` or `True` |
| `target_type` | Type of target value | *str* | False | `IC50_log10_fold` |
| `source_name` | Source of the data | *str* | False | `jian_2024_nature` |
| `source_doi` | DOI of the source | *str* | False | `10.1038/s41586-024-08315-x` |
| `assay_name` | Name of the assay | *str* | False | `elisa` |
| `pdb_id` | PDB structure ID | *str* | False | `7e88` |
| `structure_release_date` | Release date of the structure | *str* | False | `03/01/21` |
| `structure_resolution` | Resolution of the structure (Å) | *float* | False | `3.14` |
| `mutations` | Lineage consensus mutations | *str* | False | `A67V H69- V70- T95I...` |
| `antigen_chain_ids` | Chain IDs of the antigen | *str* | False | `C` |
| `antigen_domain` | Domain of the antigen | *str* | False | `RBD` |
| `antigen_residue_indices` | Residue indices of the antigen | *str* | False | `(13, 568)` |
| `antigen_residue_indices_trimmed` | Antigen residue indices, trimmed | *float* | True | `(333, 526)` |
| `antigen_host` | Host organism of the antigen | *str* | False | `severe acute respiratory syndrome coronavirus 2 (2697049)` |
| `antibody_heavy_chain_id` | Heavy chain ID of the antibody | *str* | False | `C` |
| `antibody_light_chain_id` | Light chain ID of the antibody | *str* | False | `B` |
| `epitope_residues` | Residues of the epitope | *str* | False | `R403 D405 T415` |
| `epitope_mutations` | PDB antigen mutations in the epitope | *str* | True | `D405T` |
| `epitope_domain` | Spike domain where the antibody binds | *str* | False | `RBD` |
| `epitope_alteration_count` | Number of alterations in the epitope | *float* | True | `2` |
| `spike_sequence` | Full spike protein sequence | *str* | False | `MFVFLVLLPLVSSQCVNLTTRTQL...` |
| `antibody_heavy_chain_sequence` | Sequence of the antibody heavy chain | *str* | False | `EVQLVESGGGLVQPGGSLRLSCAASEFIVSRNYMSWVR...` |
| `antibody_light_chain_sequence` | Sequence of the antibody light chain | *str* | False | `DIQMTQSPSSLSASVGDRVTITCQASQDINKYLNWYQQK...` |
| `antibody_vh_sequence` | Sequence of the VH domain | *str* | False | `EVQLVESGGGLVQPGGSLRLSCAASEFIVSRNYMSWVRQ...` |
| `antibody_vl_sequence` | Sequence of the VL domain | *str* | False | `DIQMTQSPSSLSASVGDRVTITCQASQDINKYLNW...` |
| `antigen_sequence` | Full antigen sequence | *str* | False | `TNLCPFDEVF-NATRFASVYAWNR--KRISNCVADYSVLYNLAPFFTFKCYGVSP...` |
| `antigen_sequence_trimmed` | Trimmed antigen sequence | *float* | True | `TRFASV-YAWNRKRISNCVADYSVLYNLAPFFT-FKCYGVSP...` |
| `antigen_sequence_without_indels`| Antigen sequence without insertions/deletions | *str* | False | `TNLCPFDEVFNATRFASVYAWNRKRISNCVAD...` |
| `antigen_sequence_trimmed_without_indels` | Trimmed antigen sequence without insertions/deletions | *float* | True | `NATRFASVYAWNRKRISNCVAD...` |
| `antigen_pdb_sequence` | Antigen sequence from PDB | *str* | False | `TNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSA...` |
| `antigen_pdb_sequence_trimmed` | Trimmed antigen sequence from PDB | *float* | True | `NATRFASVYAWNRKRISNCVADYSVLYNSA...` |
## Acknowledgments
This project makes use of publicly available antibody datasets listed above. We acknowledge the contributions by the teams
responsible for compiling and maintaining these valuable resources.
tags:
- 抗体(antibody)
- 抗原(antigen)
- 结合(binding)
- COVID-19
- SARS-CoV-2
- 基准数据集(benchmark)
pretty_name: CoV-UniBind
---
<div align="center">
<h1>CoV-UniBind——冠状病毒统一结合数据库</h1>
</div>
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/681b661257cfafb5ead152d1/WvLoDuemHIwY8bQfOMahb.png" alt="描述" width="300"/>
</p>
**CoV-UniBind** 精心整理并整合了针对严重急性呼吸综合征冠状病毒2型(SARS-CoV-2)及相关冠状病毒诱导产生的特异性抗体的结构与生化数据。该数据集将3D抗体结构与其在不同病毒变体中的结合特性相关联,纳入了表位与序列信息。本数据集可作为分析COVID-19背景下抗体-抗原相互作用的综合性资源,同时为基于深度学习模型的结合分类与回归任务提供基础支撑。
## 数据库概览
| 数据集名称 | 描述 | 标签 | 数据类型 | 参考文献(DOI) |
|------|-------------|-------|-----| ------- |
| covabdab | 来自冠状病毒抗体数据库(CoV-AbDab)的结合标签 | 结合 | 布尔型(bool) | 10.1093/bioinformatics/btaa739 |
| dms_bloom | Greaney等人2022年发布的深度突变扫描逃逸数据 | 逃逸 | 浮点型(float) | 10.1038/s41564-021-00972-2; 10.1016/j.chom.2020.11.007; 10.1038/s41467-021-24435-8; 10.1016/j.xcrm.2021.100255; 10.1126/science.abf9302; 10.1038/s41467-021-24435-8; 10.1038/s41564-021-00972-2; 10.1038/s41586-021-04385-3; 10.1038/s41586-022-04980-y; 10.1038/s41586-021-03817-4; 10.1038/s41586-021-03807-6 |
| dms_cao | Cao等人2023年发布的深度突变扫描逃逸数据 | 逃逸 | 浮点型(float) | 10.1038/s41586-022-05644-7 |
| jian_elisa | Jian等人2025年发布的酶联免疫吸附实验(Enzyme-Linked Immunosorbent Assay, ELISA)抗体半最大抑制浓度(Half Maximal Inhibitory Concentration, IC50)检测数据 | IC50 | 浮点型(float) | 10.1038/s41586-024-08315-x |
| spr | 多来源的表面等离子体共振(Surface Plasmon Resonance, SPR)抗体亲和力检测数据 | 解离常数(Dissociation Constant, KD) | 浮点型(float) | 10.1038/s41467-024-54916-5; 10.1038/s41586-020-2380-z; 10.1038/s41467-021-24514-w; 10.1016/j.immuni.2022.06.005; 10.1016/j.xcrm.2023.100991 |
| drdb | 来自SARS-CoV-2抗性数据库(DRDB)的中和效力数据 | IC50 | 浮点型(float) | 10.1371/journal.pone.0261045 |
## 数据库结构
.
├── antibody_info
│ └── antibody_synonyms.csv
├── data
│ ├── covabdab_binding.parquet
│ │ └── structures
│ │ ├── processed.zip
│ │ └── trimmed.zip
│ ├── dms_bloom_ab_escape.parquet
│ ├── dms_cao_ab_escape.parquet
│ ├── drdb_binding_potency.parquet
│ ├── jian_elisa_ab_ic50.parquet
│ └── spr_ab_affinity.parquet
├── scores
│ ├── covabdab_binding_scores.parquet
│ ├── dms_bloom_ab_escape_scores.parquet
│ ├── dms_cao_ab_escape_scores.parquet
│ ├── drdb_binding_potency_scores.parquet
│ ├── jian_elisa_ab_ic50_scores.parquet
│ └── spr_ab_affinity_scores.parquet
├── cov-unibind.py
└── README.md
## 使用指南
python
from datasets import load_dataset
data='drbd' # 根据上表指定数据集名称
dataset = load_dataset("InstaDeepAI/cov-unibind",name=data)
## 数据集架构
下表列出了各数据集包含的列信息。
| 列名 | 描述 | 数据类型 | 是否允许为空 | 示例 |
|---|---|---|---|---|
| `antibody_name` | 抗体名称 | 字符串(str) | 否 | `bd30_515;bd_515` |
| `antigen_lineage` | 抗原谱系 | 字符串(str) | 否 | `BA.1` |
| `target_value` | 实验结合数值 | 浮点型(float)或布尔型(bool) | 否 | `-2.327902` 或 `True` |
| `target_type` | 目标值类型 | 字符串(str) | 否 | `IC50_log10_fold` |
| `source_name` | 数据来源 | 字符串(str) | 否 | `jian_2024_nature` |
| `source_doi` | 来源文献DOI | 字符串(str) | 否 | `10.1038/s41586-024-08315-x` |
| `assay_name` | 检测方法名称 | 字符串(str) | 否 | `elisa` |
| `pdb_id` | 蛋白质数据库(Protein Data Bank, PDB)结构ID | 字符串(str) | 否 | `7e88` |
| `structure_release_date` | 结构发布日期 | 字符串(str) | 否 | `03/01/21` |
| `structure_resolution` | 结构分辨率(单位:埃,Å) | 浮点型(float) | 否 | `3.14` |
| `mutations` | 谱系共识突变位点 | 字符串(str) | 否 | `A67V H69- V70- T95I...` |
| `antigen_chain_ids` | 抗原链ID | 字符串(str) | 否 | `C` |
| `antigen_domain` | 抗原结构域 | 字符串(str) | 否 | 受体结合域(Receptor Binding Domain, RBD) |
| `antigen_residue_indices` | 抗原残基索引 | 字符串(str) | 否 | `(13, 568)` |
| `antigen_residue_indices_trimmed` | 修剪后的抗原残基索引 | 浮点型(float) | 是 | `(333, 526)` |
| `antigen_host` | 抗原宿主生物 | 字符串(str) | 否 | `severe acute respiratory syndrome coronavirus 2 (2697049)` |
| `antibody_heavy_chain_id` | 抗体重链ID | 字符串(str) | 否 | `C` |
| `antibody_light_chain_id` | 抗体轻链ID | 字符串(str) | 否 | `B` |
| `epitope_residues` | 表位残基位点 | 字符串(str) | 否 | `R403 D405 T415` |
| `epitope_mutations` | 表位内的PDB抗原突变 | 字符串(str) | 是 | `D405T` |
| `epitope_domain` | 抗体结合的刺突蛋白结构域 | 字符串(str) | 否 | 受体结合域(Receptor Binding Domain, RBD) |
| `epitope_alteration_count` | 表位内变异位点数量 | 浮点型(float) | 是 | `2` |
| `spike_sequence` | 完整刺突蛋白序列 | 字符串(str) | 否 | `MFVFLVLLPLVSSQCVNLTTRTQL...` |
| `antibody_heavy_chain_sequence` | 抗体重链序列 | 字符串(str) | 否 | `EVQLVESGGGLVQPGGSLRLSCAASEFIVSRNYMSWVR...` |
| `antibody_light_chain_sequence` | 抗体轻链序列 | 字符串(str) | 否 | `DIQMTQSPSSLSASVGDRVTITCQASQDINKYLNWYQQK...` |
| `antibody_vh_sequence` | 抗体VH结构域序列 | 字符串(str) | 否 | `EVQLVESGGGLVQPGGSLRLSCAASEFIVSRNYMSWVRQ...` |
| `antibody_vl_sequence` | 抗体VL结构域序列 | 字符串(str) | 否 | `DIQMTQSPSSLSASVGDRVTITCQASQDINKYLNW...` |
| `antigen_sequence` | 完整抗原序列 | 字符串(str) | 否 | `TNLCPFDEVF-NATRFASVYAWNR--KRISNCVADYSVLYNLAPFFTFKCYGVSP...` |
| `antigen_sequence_trimmed` | 修剪后的抗原序列 | 浮点型(float) | 是 | `TRFASV-YAWNRKRISNCVADYSVLYNLAPFFT-FKCYGVSP...` |
| `antigen_sequence_without_indels`| 无插入缺失的抗原序列 | 字符串(str) | 否 | `TNLCPFDEVFNATRFASVYAWNRKRISNCVAD...` |
| `antigen_sequence_trimmed_without_indels` | 无插入缺失的修剪后抗原序列 | 浮点型(float) | 是 | `NATRFASVYAWNRKRISNCVAD...` |
| `antigen_pdb_sequence` | PDB来源的抗原序列 | 字符串(str) | 否 | `TNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSA...` |
| `antigen_pdb_sequence_trimmed` | PDB来源的修剪后抗原序列 | 浮点型(float) | 是 | `NATRFASVYAWNRKRISNCVADYSVLYNSA...` |
## 致谢
本项目使用了上述列出的公开抗体数据集。我们对负责整理并维护这些宝贵资源的团队表示衷心感谢。
提供机构:
InstaDeepAI



