introvoyz041/agab-db

Name: introvoyz041/agab-db
Creator: introvoyz041
Published: 2026-04-09 10:27:15
License: 暂无描述

Hugging Face2026-04-09 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/introvoyz041/agab-db

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: dataset dtype: string - name: heavy_sequence dtype: string - name: light_sequence dtype: string - name: scfv dtype: bool - name: affinity_type dtype: string - name: affinity dtype: string - name: antigen_sequence dtype: string - name: confidence dtype: class_label: names: '0': medium '1': high '2': very_high - name: nanobody dtype: bool - name: processed_measurement dtype: float64 - name: target_name dtype: string - name: target_pdb dtype: string - name: target_uniprot dtype: string - name: source_url dtype: string - name: heavy_cdr1 dtype: string - name: heavy_cdr2 dtype: string - name: heavy_cdr3 dtype: string - name: light_cdr1 dtype: string - name: light_cdr2 dtype: string - name: light_cdr3 dtype: string splits: - name: train num_bytes: 2137958513 num_examples: 1227083 download_size: 339997839 dataset_size: 2137958513 configs: - config_name: default data_files: - split: train path: data/train-* pretty_name: 'AgAb DB: Antigen Specific Antibody Database' tags: - biology - immunology - antibodies - protein-protein-interactions - drug-discovery - computational-biology - therapeutics - machine-learning - protein-sequence-modeling - binding-affinity-prediction - antibody-design task_categories: - text-classification license: other license_details: "Non-commercial research use only. Commercial inquiries should be directed to NaturalAntibody." language: - en --- # AgAb DB: Antigen Specific Antibody Database A comprehensive collection of antibody-antigen interaction data for computational biology and therapeutic design. ## Dataset Summary AgAb DB aggregates antibody-antigen binding data from multiple sources, containing over 1.2 million antibody-antigen pairs with binding affinity measurements. This dataset is essential for training machine learning models in computational immunology and antibody engineering. ## Key Statistics - **1,227,083** antibody-antigen interaction records - **309,884** unique antibodies (full antibodies, nanobodies, scFvs) - **4,334** unique antigens - **170,660** complete heavy/light chain pairs - **70,388** nanobodies and **132,157** scFv antibodies - **Focus on human health**: Infectious diseases, cancer, autoimmune conditions - **Diverse antigen types**: Viral proteins, bacterial antigens, cancer markers, autoantigens *Note: Statistics for unique antibodies/antigens are from original documentation and may be proportionally larger in the full 1.2M record dataset.* ### Data Quality Distribution - **51% very_high confidence** (robust sequences and methodology) - **high confidence** (manually curated datasets) - **medium confidence** (automated discovery, some uncertainty) ### Affinity Measurement Types - Quantitative metrics: Gibbs free energy changes, kinetic constants, IC₅₀ - Qualitative binding assessments - Mixed data types across different sources ## Data Structure ### Core Fields | Field | Type | Description | |-------|------|-------------| | `heavy_sequence` | string | Antibody heavy chain amino acid sequence | | `light_sequence` | string | Antibody light chain amino acid sequence | | `antigen_sequence` | string | Target antigen amino acid sequence | | `affinity` | string | Binding affinity value | | `confidence` | string | Data quality level (very_high, high, medium) | ### Additional Metadata | Field | Type | Description | |-------|------|-------------| | `dataset` | string | Original source dataset | | `affinity_type` | string | Measurement type (KD, IC₅₀, etc.) | | `nanobody` | bool | Whether it's a nanobody | | `scfv` | bool | Single-chain variable fragment | | `target_name` | string | Antigen name | | `target_pdb` | string | PDB structure ID | | `target_uniprot` | string | UniProt accession | | `heavy_cdr1/cdr2/cdr3` | string | Complementarity-determining regions | | `light_cdr1/cdr2/cdr3` | string | Light chain CDRs | ## Dataset Split - **Train**: All 1,227,083 records in a single training set The full dataset is provided as a single training split to maximize available data for machine learning applications. Users can create their own validation/test splits as needed for their specific use cases. ### Confidence Categories - **very_high**: Both sequences and methodology used for calculating affinity were robust (e.g., AbDesign, BioMap, SKEMPI 2.0) - **high**: Manually curated datasets or those containing antigen names/mutations rather than full sequences (e.g., FLAB datasets) - **medium**: Automated data discovery with some uncertainty (e.g., patent databases) ### Antibody Types Included - **Full antibodies**: Complete heavy and light chain pairs (traditional monoclonal antibodies) - **Nanobodies**: Single-domain antibodies (VHH format) - 70K+ entries across datasets - **scFv**: Single-chain variable fragments - 132K+ entries, primarily from AlphaSeq - **Mixed formats**: Various antibody fragment types and engineered variants ### Nanobody Distribution by Source | Source | Nanobody Count | Notes | |--------|----------------|-------| | AlphaSeq | 67,058 | Mutations for improved binding | | Patents | 40,517 | Patent literature extraction | | Literature | 1,936 | Research paper curation | | Structures | 1,258 | PDB structure-derived | | AATP, OSH, RMNA | ~133 | Specialized datasets | ### scFv Distribution by Source | Source | scFv Count | Notes | |--------|------------|-------| | AlphaSeq | 131,645 | Primary scFv source | | Literature | 512 | Research paper curation | ### Sequence Characteristics - **Predominantly short sequences**: <150 amino acids typical - **Majority include both chains**: Heavy and light chain pairs - **Diverse antigen targets**: Infectious diseases, cancer, autoimmune conditions - **Multiple affinity measurement types**: KD, IC₅₀, ΔG, binary binding ## Usage ### Load the Dataset ```python from datasets import load_dataset # Load from OpenMed dataset = load_dataset("OpenMed/agab-db") # Access the training data (full dataset) train_data = dataset["train"] # Optional: Create your own validation/test splits from sklearn.model_selection import train_test_split import pandas as pd # Convert to pandas for splitting df = pd.DataFrame(train_data) train_df, test_df = train_test_split(df, test_size=0.1, random_state=42) train_df, val_df = train_test_split(train_df, test_size=0.1, random_state=42) ``` ### Filter for Research ```python # High-quality data only high_quality = dataset.filter(lambda x: x["confidence"] == "very_high") # Nanobodies for specialized studies nanobodies = dataset.filter(lambda x: x["nanobody"] == True) # Specific antigens covid_data = dataset.filter(lambda x: "covid" in x["target_name"].lower()) ``` ### Prepare for ML Training ```python # Extract sequences for language models sequences = [] for item in dataset["train"]: if item["heavy_sequence"]: sequences.append(item["heavy_sequence"]) if item["light_sequence"]: sequences.append(item["light_sequence"]) ``` ## Applications ### Machine Learning Use Cases - **Antibody language models**: Train sequence models on antibody repertoires for generative design - **Binding affinity prediction**: Develop regression models for antibody-antigen interaction strength - **Therapeutic design**: Guide rational antibody engineering and optimization - **Computational immunology**: Study immune responses and antibody development patterns - **Virtual screening**: Prioritize antibody candidates for experimental validation - **Structure-affinity relationships**: Learn connections between 3D structures and binding properties ### Research Applications - **Antibody repertoire analysis**: Study natural antibody diversity and evolution - **Cross-reactivity prediction**: Identify potential off-target effects - **Immunogenicity assessment**: Predict antibody developability and safety - **Drug discovery pipelines**: Accelerate hit identification and lead optimization - **Comparative immunology**: Study antibody responses across different species ### Integration with Other Tools - **Protein structure prediction**: Use with ESMFold for 3D structure generation - **Molecular dynamics**: Combine with simulation tools for binding mechanism studies - **High-throughput screening**: Guide experimental antibody library screening - **CRISPR engineering**: Design antibodies for gene therapy applications ## Data Sources Aggregated from 25+ datasets including GenBank, SKEMPI 2.0, peer-reviewed publications, and patent databases. ### Major Dataset Components | Dataset | Records | Unique Antibodies | Key Characteristics | |---------|---------|-------------------|-------------------| | **BUZZ** | 524,346 | 524,346 | Trastuzumab mutations binding to HER2 | | **AlphaSeq** | 198,703 | 193,867 | Antibody mutations across 4 targets (TIGIT, SARS-CoV2-RBD, PD-1, HER2) | | **ABBD** | 155,853 | 88,946 | Eight antibody-antigen cases with heavy chain mutations | | **Patents** | 217,463 | 31,173 | NLP-extracted sequences from patent literature | | **COVID-19** | 27,301 | 6,759 | SARS-CoV-2 neutralization data (Cov-AbDab) | | **HIV** | 48,008 | 192 | HIV-targeting antibodies (LANL database) | | **BioMap** | 2,725 | 728 | Binding ΔG values across 8 species | | **Literature** | 5,580 | 4,841 | Curated from research articles (1,940 nanobodies) | | **FLAB** | 6,849 | 6,798 | Five publications on viral/cancer targets | | **ABDesign** | 672 | 672 | Systematic CDR-H3 point mutations | ### Inclusion Criteria - Transparency and completeness of data - Relevance to human health - Quantitative binding affinity measurements - Complete amino acid sequences for all biomolecules ### Data Processing Pipeline 1. **Aggregation**: Collection from 14 distinct sources → 25 integrated datasets 2. **Curation**: Multi-stage pipeline with automated extraction, normalization, and manual verification 3. **Standardization**: Common structure implemented across all studies 4. **Validation**: Automated feasibility checks and manual verification of critical datasets ## Citation ```bibtex @dataset{agab_db, title={AgAb DB: Antigen Specific Antibody Database}, author={NaturalAntibody}, year={2024}, url={https://naturalantibody.com/agab/} } ``` ## License Available for non-commercial research use only. Contact NaturalAntibody for commercial licensing. --- *Dataset provided by [NaturalAntibody](https://naturalantibody.com/agab/)*

### 数据集元信息 #### 特征字段 | 字段名 | 数据类型 | 说明 | |-------|----------|------| | `dataset` | 字符串 | 原始来源数据集 | | `heavy_sequence` | 字符串 | 抗体重链氨基酸序列 | | `light_sequence` | 字符串 | 抗体轻链氨基酸序列 | | `scfv` | 布尔值 | 是否为单链可变片段（single-chain variable fragment, scFv） | | `affinity_type` | 字符串 | 亲和力测量类型 | | `affinity` | 字符串 | 结合亲和力数值 | | `antigen_sequence` | 字符串 | 目标抗原氨基酸序列 | | `confidence` | 分类标签 | 数据置信度：<br>0: 中等（medium）<br>1: 高（high）<br>2: 极高（very_high） | | `nanobody` | 布尔值 | 是否为纳米抗体（nanobody） | | `processed_measurement` | 64位浮点数 | 经处理的测量值 | | `target_name` | 字符串 | 抗原名称 | | `target_pdb` | 字符串 | 蛋白质数据库（Protein Data Bank, PDB）结构ID | | `target_uniprot` | 字符串 | 通用蛋白质资源（Universal Protein Resource, UniProt）登录号 | | `source_url` | 字符串 | 原始数据源链接 | | `heavy_cdr1`/`heavy_cdr2`/`heavy_cdr3` | 字符串 | 抗体重链互补决定区（complementarity-determining region, CDR）1/2/3序列 | | `light_cdr1`/`light_cdr2`/`light_cdr3` | 字符串 | 抗体轻链CDR1/2/3序列 | #### 数据集划分 | 划分名称 | 字节大小 | 样本数量 | |---------|----------|----------| | `train` | 2137958513 | 1227083 | #### 数据量统计下载大小：339997839字节数据集总大小：2137958513字节 #### 配置信息 - 默认配置：`default` - 数据文件路径：`data/train-*`（对应训练集划分） #### 数据集标识与标签 - 友好名称：`AgAb DB：抗原特异性抗体数据库` - 标签：生物学、免疫学、抗体、蛋白质-蛋白质相互作用、药物发现、计算生物学、治疗学、机器学习、蛋白质序列建模、结合亲和力预测、抗体设计 - 任务类别：文本分类 - 许可证：其他 - 许可证详情：仅可用于非商业研究用途，商业咨询请联系NaturalAntibody - 语言：英语 # AgAb DB：抗原特异性抗体数据库这是一个面向计算生物学与治疗性抗体设计的综合性抗体-抗原相互作用数据集集合。 ## 数据集概述 AgAb DB 整合了多来源的抗体-抗原结合数据，包含超过120万条抗体-抗原配对数据及结合亲和力测量值。该数据集是训练计算免疫学与抗体工程领域机器学习模型的核心资源。 ## 关键统计数据 - **1,227,083**条抗体-抗原相互作用记录 - **309,884**种独特抗体（完整抗体、纳米抗体（nanobody）、单链可变片段（single-chain variable fragment, scFv）） - **4,334**种独特抗原 - **170,660**套完整重/轻链配对序列 - **70,388**条纳米抗体与**132,157**条单链可变片段抗体 - **聚焦人类健康领域**：涵盖传染病、癌症、自身免疫性疾病 - **多样化抗原类型**：病毒蛋白、细菌抗原、癌症标志物、自身抗原 *注：独特抗体/抗原的统计数据源自原始文档，在完整的120万条记录数据集中，该数值可能会按比例放大。* ### 数据质量分布 - **51%为极高置信度**（序列与实验方法均具备可靠性） - **高置信度**（人工整理的数据集） - **中等置信度**（自动化发现的存在一定不确定性的数据） ### 亲和力测量类型 - 定量指标：吉布斯自由能变化（Gibbs free energy change, ΔG）、动力学常数、半最大抑制浓度（half maximal inhibitory concentration, IC₅₀） - 定性结合评估 - 不同来源包含混合数据类型 ## 数据结构 ### 核心字段 | 字段 | 类型 | 描述 | |-------|------|-------------| | `heavy_sequence` | 字符串 | 抗体重链氨基酸序列 | | `light_sequence` | 字符串 | 抗体轻链氨基酸序列 | | `antigen_sequence` | 字符串 | 目标抗原氨基酸序列 | | `affinity` | 字符串 | 结合亲和力数值 | | `confidence` | 字符串 | 数据质量等级（very_high、high、medium） | ### 附加元数据 | 字段 | 类型 | 描述 | |-------|------|-------------| | `dataset` | 字符串 | 原始来源数据集 | | `affinity_type` | 字符串 | 测量类型（解离常数KD、IC₅₀等） | | `nanobody` | 布尔值 | 是否为纳米抗体（nanobody） | | `scfv` | 布尔值 | 是否为单链可变片段（single-chain variable fragment, scFv） | | `target_name` | 字符串 | 抗原名称 | | `target_pdb` | 字符串 | 蛋白质数据库（Protein Data Bank, PDB）结构ID | | `target_uniprot` | 字符串 | 通用蛋白质资源（Universal Protein Resource, UniProt）登录号 | | `heavy_cdr1/cdr2/cdr3` | 字符串 | 重链互补决定区（complementarity-determining region, CDR）1/2/3 | | `light_cdr1/cdr2/cdr3` | 字符串 | 轻链CDR1/2/3 | ## 数据集划分 - **训练集**：包含全部1,227,083条记录的单一训练集该数据集仅提供单一训练划分，以最大化机器学习应用可用的数据量。用户可根据自身特定需求自行划分验证集/测试集。 ### 置信度分类 - **极高置信度（very_high）**：用于计算亲和力的序列与实验方法均具备可靠性（例如AbDesign、BioMap、SKEMPI 2.0） - **高置信度（high）**：人工整理（curated）的数据集，或仅包含抗原名称/突变信息而非完整序列的数据集（例如FLAB数据集） - **中等置信度（medium）**：通过自动化手段发现的存在一定不确定性的数据（例如专利数据库） ### 包含的抗体类型 - **完整抗体**：完整的重链与轻链配对序列（传统单克隆抗体） - **纳米抗体（nanobody）**：单结构域抗体（VHH格式）—— 数据集中共包含7万余条记录 - **单链可变片段（scFv）**：单链可变片段—— 共包含13万余条记录，主要源自AlphaSeq - **混合格式**：多种抗体片段类型与工程化变体 ### 纳米抗体按来源的分布 | 来源 | 纳米抗体数量 | 备注 | |--------|----------------|-------| | AlphaSeq | 67,058 | 用于优化结合力的突变体 | | 专利文献 | 40,517 | 从专利文献中提取的序列 | | 研究文献 | 1,936 | 经人工整理的研究论文数据 | | 蛋白质结构 | 1,258 | 源自PDB结构的序列 | | AATP、OSH、RMNA | ~133 | 专业数据集 | ### scFv按来源的分布 | 来源 | scFv数量 | 备注 | |--------|------------|-------| | AlphaSeq | 131,645 | 主要的scFv数据来源 | | 研究文献 | 512 | 经人工整理的研究论文数据 | ### 序列特征 - **以短序列为主**：典型长度小于150个氨基酸 - **多数包含双链序列**：重链与轻链配对序列 - **多样化抗原靶点**：覆盖传染病、癌症、自身免疫性疾病 - **多种亲和力测量类型**：KD、IC₅₀、ΔG、二元结合评分 ## 使用方法 ### 加载数据集 python from datasets import load_dataset # 从OpenMed加载数据集 dataset = load_dataset("OpenMed/agab-db") # 访问训练数据（完整数据集） train_data = dataset["train"] # 可选：自行划分验证集/测试集 from sklearn.model_selection import train_test_split import pandas as pd # 转换为pandas DataFrame以进行划分 df = pd.DataFrame(train_data) train_df, test_df = train_test_split(df, test_size=0.1, random_state=42) train_df, val_df = train_test_split(train_df, test_size=0.1, random_state=42) ### 科研筛选示例 python # 仅保留高质量数据 high_quality = dataset.filter(lambda x: x["confidence"] == "very_high") # 筛选纳米抗体用于专项研究 nanobodies = dataset.filter(lambda x: x["nanobody"] == True) # 筛选特定抗原数据（例如新冠相关） covid_data = dataset.filter(lambda x: "covid" in x["target_name"].lower()) ### 机器学习训练预处理 python # 为语言模型提取序列数据 sequences = [] for item in dataset["train"]: if item["heavy_sequence"]: sequences.append(item["heavy_sequence"]) if item["light_sequence"]: sequences.append(item["light_sequence"]) ## 应用场景 ### 机器学习应用场景 - **抗体语言模型**：在抗体库上训练序列模型，用于生成式抗体设计 - **结合亲和力预测**：构建回归模型预测抗体-抗原相互作用强度 - **治疗性设计**：指导理性抗体工程与亲和力优化 - **计算免疫学**：研究免疫应答与抗体发育模式 - **虚拟筛选**：优先筛选候选抗体用于实验验证 - **结构-亲和力关联**：学习蛋白质三维结构与结合特性之间的关联 ### 科研应用场景 - **抗体库分析**：研究天然抗体多样性与进化规律 - **交叉反应性预测**：识别潜在的脱靶效应 - **免疫原性评估**：预测抗体的成药性与安全性 - **药物发现管线**：加速候选药物的靶点识别与先导化合物优化 - **比较免疫学**：研究不同物种间的抗体应答差异 ### 与其他工具的集成 - **蛋白质结构预测**：搭配ESMFold进行三维结构生成 - **分子动力学模拟**：结合模拟工具研究结合机制 - **高通量筛选**：指导实验抗体库的筛选工作 - **CRISPR工程化**：设计用于基因治疗的抗体 ## 数据来源该数据集整合了25+个数据源，包括GenBank、SKEMPI 2.0、同行评议论文与专利数据库。 ### 主要数据集组件 | 数据集 | 记录数 | 独特抗体数 | 核心特征 | |---------|---------|-------------------|-------------------| | **BUZZ** | 524,346 | 524,346 | 靶向HER2的曲妥珠单抗突变体结合数据 | | **AlphaSeq** | 198,703 | 193,867 | 覆盖4个靶点（TIGIT、SARS-CoV2-RBD、PD-1、HER2）的抗体突变数据 | | **ABBD** | 155,853 | 88,946 | 8个抗体-抗原案例的重链突变数据 | | **专利数据集** | 217,463 | 31,173 | 从专利文献中通过自然语言处理提取的序列 | | **COVID-19** | 27,301 | 6,759 | SARS-CoV-2中和数据（源自Cov-AbDab） | | **HIV** | 48,008 | 192 | 靶向HIV的抗体（源自LANL数据库） | | **BioMap** | 2,725 | 728 | 覆盖8个物种的结合ΔG值数据 | | **研究文献** | 5,580 | 4,841 | 经人工整理的研究论文数据（包含1,940条纳米抗体记录） | | **FLAB** | 6,849 | 6,798 | 5篇关于病毒/癌症靶点的研究论文数据 | | **ABDesign** | 672 | 672 | 系统性CDR-H3点突变数据 | ### 入选标准 - 数据的透明度与完整性 - 与人类健康的相关性 - 定量结合亲和力测量值 - 所有生物分子的完整氨基酸序列 ### 数据处理流程 1. **数据聚合**：从14个独立来源收集数据，整合为25个集成数据集 2. **数据整理**：多阶段流水线，包含自动化提取、标准化与人工验证 3. **格式统一**：为所有研究建立通用数据结构 4. **有效性验证**：自动化可行性检查与关键数据集的人工验证 ## 引用格式 bibtex @dataset{agab_db, title={AgAb DB：抗原特异性抗体数据库}, author={NaturalAntibody}, year={2024}, url={https://naturalantibody.com/agab/} } ## 许可证本数据集仅可用于非商业性研究用途。如需商业授权，请联系NaturalAntibody。 --- *数据集由 [NaturalAntibody](https://naturalantibody.com/agab/) 提供*

提供机构：

introvoyz041

5,000+

优质数据集

54 个

任务类型

进入经典数据集