five

InstaDeepAI/CoVUniBind

收藏
Hugging Face2025-12-02 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/InstaDeepAI/CoVUniBind
下载链接
链接失效反馈
官方服务:
资源简介:
--- tags: - antibody - antigen - binding - COVID-19 - SARS-CoV-2 - benchmark pretty_name: CoV-UniBind --- <div align="center"> <h1>CoV-UniBind - Coronavirus Unified Binding Database</h1> </div> <p align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/681b661257cfafb5ead152d1/WvLoDuemHIwY8bQfOMahb.png" alt="Description" width="300"/> </p> **CoV-UniBind** curates and integrates structural and biochemical data on antibodies specifically elicited by SARS-CoV-2 and related coronaviruses. It links 3D antibody structures to their binding properties across viral variants, incorporating epitope and sequence information. This dataset serves as a comprehensive resource for analysing antibody-antigen interactions in the context of COVID-19 and provides a foundation for binding classification and regression tasks using deep learning models. ## Database Overview | Dataset | Description | Label | Type | References (doi) | |------|-------------|-------|-----| ------- | | covabdab | Binding labels from the Coronavirus Antibody Database (CoV-AbDab) | binding | bool | 10.1093/bioinformatics/btaa739 | | dms_bloom | Deep Mutational Scanning escape data from Greaney et al. 2022 | escape | float | 10.1038/s41564-021-00972-2; 10.1016/j.chom.2020.11.007; 10.1038/s41467-021-24435-8; 10.1016/j.xcrm.2021.100255; 10.1126/science.abf9302; 10.1038/s41467-021-24435-8; 10.1038/s41564-021-00972-2; 10.1038/s41586-021-04385-3; 10.1038/s41586-022-04980-y; 10.1038/s41586-021-03817-4; 10.1038/s41586-021-03807-6 | | dms_cao | Deep Mutational Scanning escape data from Cao et al. 2023 | escape | float | 10.1038/s41586-022-05644-7 | | jian_elisa | ELISA antibody IC50 measurements from Jian et al. 2025 | IC50 | float | 10.1038/s41586-024-08315-x | | spr | Surface Plasmon Resonance antibody affinity measurements from multiple sources | KD | float | 10.1038/s41467-024-54916-5; 10.1038/s41586-020-2380-z; 10.1038/s41467-021-24514-w; 10.1016/j.immuni.2022.06.005; 10.1016/j.xcrm.2023.100991 | | drdb | Neutralisation potency data from the SARS-CoV-2 Resistance Database (DRDB) | IC50 | float | 10.1371/journal.pone.0261045 | ## Database Structure ``` . ├── antibody_info │ └── antibody_synonyms.csv ├── data │ ├── covabdab_binding.parquet │ │ └── structures │ │ ├── processed.zip │ │ └── trimmed.zip │ ├── dms_bloom_ab_escape.parquet │ ├── dms_cao_ab_escape.parquet │ ├── drdb_binding_potency.parquet │ ├── jian_elisa_ab_ic50.parquet │ └── spr_ab_affinity.parquet ├── scores │ ├── covabdab_binding_scores.parquet │ ├── dms_bloom_ab_escape_scores.parquet │ ├── dms_cao_ab_escape_scores.parquet │ ├── drdb_binding_potency_scores.parquet │ ├── jian_elisa_ab_ic50_scores.parquet │ └── spr_ab_affinity_scores.parquet ├── cov-unibind.py └── README.md ``` ## Usage Guide ```python from datasets import load_dataset data='drbd' # specify the dataset name based on table above dataset = load_dataset("InstaDeepAI/cov-unibind",name=data) ``` ## Dataset Schema The table below includes information about the columns contained in the datasets. | Column Name | Description | Type | Nullable | Example | |---|---|---|---|---| | `antibody_name` | Name of the antibody | *str* | False | `bd30_515;bd_515` | | `antigen_lineage` | Antigen lineage | *str* | False | `BA.1` | | `target_value` | Experimental binding value| *float* or *bool* | False | `-2.327902` or `True` | | `target_type` | Type of target value | *str* | False | `IC50_log10_fold` | | `source_name` | Source of the data | *str* | False | `jian_2024_nature` | | `source_doi` | DOI of the source | *str* | False | `10.1038/s41586-024-08315-x` | | `assay_name` | Name of the assay | *str* | False | `elisa` | | `pdb_id` | PDB structure ID | *str* | False | `7e88` | | `structure_release_date` | Release date of the structure | *str* | False | `03/01/21` | | `structure_resolution` | Resolution of the structure (Å) | *float* | False | `3.14` | | `mutations` | Lineage consensus mutations | *str* | False | `A67V H69- V70- T95I...` | | `antigen_chain_ids` | Chain IDs of the antigen | *str* | False | `C` | | `antigen_domain` | Domain of the antigen | *str* | False | `RBD` | | `antigen_residue_indices` | Residue indices of the antigen | *str* | False | `(13, 568)` | | `antigen_residue_indices_trimmed` | Antigen residue indices, trimmed | *float* | True | `(333, 526)` | | `antigen_host` | Host organism of the antigen | *str* | False | `severe acute respiratory syndrome coronavirus 2 (2697049)` | | `antibody_heavy_chain_id` | Heavy chain ID of the antibody | *str* | False | `C` | | `antibody_light_chain_id` | Light chain ID of the antibody | *str* | False | `B` | | `epitope_residues` | Residues of the epitope | *str* | False | `R403 D405 T415` | | `epitope_mutations` | PDB antigen mutations in the epitope | *str* | True | `D405T` | | `epitope_domain` | Spike domain where the antibody binds | *str* | False | `RBD` | | `epitope_alteration_count` | Number of alterations in the epitope | *float* | True | `2` | | `spike_sequence` | Full spike protein sequence | *str* | False | `MFVFLVLLPLVSSQCVNLTTRTQL...` | | `antibody_heavy_chain_sequence` | Sequence of the antibody heavy chain | *str* | False | `EVQLVESGGGLVQPGGSLRLSCAASEFIVSRNYMSWVR...` | | `antibody_light_chain_sequence` | Sequence of the antibody light chain | *str* | False | `DIQMTQSPSSLSASVGDRVTITCQASQDINKYLNWYQQK...` | | `antibody_vh_sequence` | Sequence of the VH domain | *str* | False | `EVQLVESGGGLVQPGGSLRLSCAASEFIVSRNYMSWVRQ...` | | `antibody_vl_sequence` | Sequence of the VL domain | *str* | False | `DIQMTQSPSSLSASVGDRVTITCQASQDINKYLNW...` | | `antigen_sequence` | Full antigen sequence | *str* | False | `TNLCPFDEVF-NATRFASVYAWNR--KRISNCVADYSVLYNLAPFFTFKCYGVSP...` | | `antigen_sequence_trimmed` | Trimmed antigen sequence | *float* | True | `TRFASV-YAWNRKRISNCVADYSVLYNLAPFFT-FKCYGVSP...` | | `antigen_sequence_without_indels`| Antigen sequence without insertions/deletions | *str* | False | `TNLCPFDEVFNATRFASVYAWNRKRISNCVAD...` | | `antigen_sequence_trimmed_without_indels` | Trimmed antigen sequence without insertions/deletions | *float* | True | `NATRFASVYAWNRKRISNCVAD...` | | `antigen_pdb_sequence` | Antigen sequence from PDB | *str* | False | `TNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSA...` | | `antigen_pdb_sequence_trimmed` | Trimmed antigen sequence from PDB | *float* | True | `NATRFASVYAWNRKRISNCVADYSVLYNSA...` | ## Acknowledgments This project makes use of publicly available antibody datasets listed above. We acknowledge the contributions by the teams responsible for compiling and maintaining these valuable resources.

tags: - 抗体(antibody) - 抗原(antigen) - 结合(binding) - COVID-19 - SARS-CoV-2 - 基准数据集(benchmark) pretty_name: CoV-UniBind --- <div align="center"> <h1>CoV-UniBind——冠状病毒统一结合数据库</h1> </div> <p align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/681b661257cfafb5ead152d1/WvLoDuemHIwY8bQfOMahb.png" alt="描述" width="300"/> </p> **CoV-UniBind** 精心整理并整合了针对严重急性呼吸综合征冠状病毒2型(SARS-CoV-2)及相关冠状病毒诱导产生的特异性抗体的结构与生化数据。该数据集将3D抗体结构与其在不同病毒变体中的结合特性相关联,纳入了表位与序列信息。本数据集可作为分析COVID-19背景下抗体-抗原相互作用的综合性资源,同时为基于深度学习模型的结合分类与回归任务提供基础支撑。 ## 数据库概览 | 数据集名称 | 描述 | 标签 | 数据类型 | 参考文献(DOI) | |------|-------------|-------|-----| ------- | | covabdab | 来自冠状病毒抗体数据库(CoV-AbDab)的结合标签 | 结合 | 布尔型(bool) | 10.1093/bioinformatics/btaa739 | | dms_bloom | Greaney等人2022年发布的深度突变扫描逃逸数据 | 逃逸 | 浮点型(float) | 10.1038/s41564-021-00972-2; 10.1016/j.chom.2020.11.007; 10.1038/s41467-021-24435-8; 10.1016/j.xcrm.2021.100255; 10.1126/science.abf9302; 10.1038/s41467-021-24435-8; 10.1038/s41564-021-00972-2; 10.1038/s41586-021-04385-3; 10.1038/s41586-022-04980-y; 10.1038/s41586-021-03817-4; 10.1038/s41586-021-03807-6 | | dms_cao | Cao等人2023年发布的深度突变扫描逃逸数据 | 逃逸 | 浮点型(float) | 10.1038/s41586-022-05644-7 | | jian_elisa | Jian等人2025年发布的酶联免疫吸附实验(Enzyme-Linked Immunosorbent Assay, ELISA)抗体半最大抑制浓度(Half Maximal Inhibitory Concentration, IC50)检测数据 | IC50 | 浮点型(float) | 10.1038/s41586-024-08315-x | | spr | 多来源的表面等离子体共振(Surface Plasmon Resonance, SPR)抗体亲和力检测数据 | 解离常数(Dissociation Constant, KD) | 浮点型(float) | 10.1038/s41467-024-54916-5; 10.1038/s41586-020-2380-z; 10.1038/s41467-021-24514-w; 10.1016/j.immuni.2022.06.005; 10.1016/j.xcrm.2023.100991 | | drdb | 来自SARS-CoV-2抗性数据库(DRDB)的中和效力数据 | IC50 | 浮点型(float) | 10.1371/journal.pone.0261045 | ## 数据库结构 . ├── antibody_info │ └── antibody_synonyms.csv ├── data │ ├── covabdab_binding.parquet │ │ └── structures │ │ ├── processed.zip │ │ └── trimmed.zip │ ├── dms_bloom_ab_escape.parquet │ ├── dms_cao_ab_escape.parquet │ ├── drdb_binding_potency.parquet │ ├── jian_elisa_ab_ic50.parquet │ └── spr_ab_affinity.parquet ├── scores │ ├── covabdab_binding_scores.parquet │ ├── dms_bloom_ab_escape_scores.parquet │ ├── dms_cao_ab_escape_scores.parquet │ ├── drdb_binding_potency_scores.parquet │ ├── jian_elisa_ab_ic50_scores.parquet │ └── spr_ab_affinity_scores.parquet ├── cov-unibind.py └── README.md ## 使用指南 python from datasets import load_dataset data='drbd' # 根据上表指定数据集名称 dataset = load_dataset("InstaDeepAI/cov-unibind",name=data) ## 数据集架构 下表列出了各数据集包含的列信息。 | 列名 | 描述 | 数据类型 | 是否允许为空 | 示例 | |---|---|---|---|---| | `antibody_name` | 抗体名称 | 字符串(str) | 否 | `bd30_515;bd_515` | | `antigen_lineage` | 抗原谱系 | 字符串(str) | 否 | `BA.1` | | `target_value` | 实验结合数值 | 浮点型(float)或布尔型(bool) | 否 | `-2.327902` 或 `True` | | `target_type` | 目标值类型 | 字符串(str) | 否 | `IC50_log10_fold` | | `source_name` | 数据来源 | 字符串(str) | 否 | `jian_2024_nature` | | `source_doi` | 来源文献DOI | 字符串(str) | 否 | `10.1038/s41586-024-08315-x` | | `assay_name` | 检测方法名称 | 字符串(str) | 否 | `elisa` | | `pdb_id` | 蛋白质数据库(Protein Data Bank, PDB)结构ID | 字符串(str) | 否 | `7e88` | | `structure_release_date` | 结构发布日期 | 字符串(str) | 否 | `03/01/21` | | `structure_resolution` | 结构分辨率(单位:埃,Å) | 浮点型(float) | 否 | `3.14` | | `mutations` | 谱系共识突变位点 | 字符串(str) | 否 | `A67V H69- V70- T95I...` | | `antigen_chain_ids` | 抗原链ID | 字符串(str) | 否 | `C` | | `antigen_domain` | 抗原结构域 | 字符串(str) | 否 | 受体结合域(Receptor Binding Domain, RBD) | | `antigen_residue_indices` | 抗原残基索引 | 字符串(str) | 否 | `(13, 568)` | | `antigen_residue_indices_trimmed` | 修剪后的抗原残基索引 | 浮点型(float) | 是 | `(333, 526)` | | `antigen_host` | 抗原宿主生物 | 字符串(str) | 否 | `severe acute respiratory syndrome coronavirus 2 (2697049)` | | `antibody_heavy_chain_id` | 抗体重链ID | 字符串(str) | 否 | `C` | | `antibody_light_chain_id` | 抗体轻链ID | 字符串(str) | 否 | `B` | | `epitope_residues` | 表位残基位点 | 字符串(str) | 否 | `R403 D405 T415` | | `epitope_mutations` | 表位内的PDB抗原突变 | 字符串(str) | 是 | `D405T` | | `epitope_domain` | 抗体结合的刺突蛋白结构域 | 字符串(str) | 否 | 受体结合域(Receptor Binding Domain, RBD) | | `epitope_alteration_count` | 表位内变异位点数量 | 浮点型(float) | 是 | `2` | | `spike_sequence` | 完整刺突蛋白序列 | 字符串(str) | 否 | `MFVFLVLLPLVSSQCVNLTTRTQL...` | | `antibody_heavy_chain_sequence` | 抗体重链序列 | 字符串(str) | 否 | `EVQLVESGGGLVQPGGSLRLSCAASEFIVSRNYMSWVR...` | | `antibody_light_chain_sequence` | 抗体轻链序列 | 字符串(str) | 否 | `DIQMTQSPSSLSASVGDRVTITCQASQDINKYLNWYQQK...` | | `antibody_vh_sequence` | 抗体VH结构域序列 | 字符串(str) | 否 | `EVQLVESGGGLVQPGGSLRLSCAASEFIVSRNYMSWVRQ...` | | `antibody_vl_sequence` | 抗体VL结构域序列 | 字符串(str) | 否 | `DIQMTQSPSSLSASVGDRVTITCQASQDINKYLNW...` | | `antigen_sequence` | 完整抗原序列 | 字符串(str) | 否 | `TNLCPFDEVF-NATRFASVYAWNR--KRISNCVADYSVLYNLAPFFTFKCYGVSP...` | | `antigen_sequence_trimmed` | 修剪后的抗原序列 | 浮点型(float) | 是 | `TRFASV-YAWNRKRISNCVADYSVLYNLAPFFT-FKCYGVSP...` | | `antigen_sequence_without_indels`| 无插入缺失的抗原序列 | 字符串(str) | 否 | `TNLCPFDEVFNATRFASVYAWNRKRISNCVAD...` | | `antigen_sequence_trimmed_without_indels` | 无插入缺失的修剪后抗原序列 | 浮点型(float) | 是 | `NATRFASVYAWNRKRISNCVAD...` | | `antigen_pdb_sequence` | PDB来源的抗原序列 | 字符串(str) | 否 | `TNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSA...` | | `antigen_pdb_sequence_trimmed` | PDB来源的修剪后抗原序列 | 浮点型(float) | 是 | `NATRFASVYAWNRKRISNCVADYSVLYNSA...` | ## 致谢 本项目使用了上述列出的公开抗体数据集。我们对负责整理并维护这些宝贵资源的团队表示衷心感谢。
提供机构:
InstaDeepAI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作