melanierb/ProteinSpace-TheraSAbDab-mAbs

Name: melanierb/ProteinSpace-TheraSAbDab-mAbs
Creator: melanierb
Published: 2026-04-19 20:21:34
License: 暂无描述

Hugging Face2026-04-19 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/melanierb/ProteinSpace-TheraSAbDab-mAbs

下载链接

链接失效反馈

官方服务：

资源简介：

--- configs: - config_name: heavy_chain data_files: - split: train path: data/heavy_chain.parquet - config_name: light_chain data_files: - split: train path: data/light_chain.parquet --- # ProteinSpace-TheraSAbDab-mAbs ## Dataset Description ProteinSpace-TheraSAbDab-mAbs is a curated collection of therapeutic antibody sequences derived from the Thera-SAbDab database. The dataset contains 1,716 antibody chain sequences (858 heavy chains and 858 light chains) from 858 therapeutic monoclonal antibodies (mAbs). Each sequence has been processed with ANARCI (Antibody Numbering and Receptor ClassIfication) to provide IMGT-numbered positions, region boundaries (framework and CDR regions), and germline gene assignments. The heavy chain and light chain sequences are provided as separate dataset configurations. ## Dataset Summary - **Total sequences:** 1,716 (858 heavy chains + 858 light chains) - **Therapeutic antibodies:** 858 - **Format:** Whole mAb and Whole mAb ADC only - **Genetics:** Genetically human, Humanized, Chimeric/Humanized, Chimeric, Murine - **Numbering scheme:** IMGT - **Regions annotated:** FWR1, CDR1, FWR2, CDR2, FWR3, CDR3, FWR4 ## Dataset Structure ### Configurations This dataset has two configurations: - `heavy_chain`: 858 heavy chain sequences - `light_chain`: 858 light chain sequences ```python from datasets import load_dataset heavy = load_dataset("melanierb/ProteinSpace-TheraSAbDab-mAbs", "heavy_chain") light = load_dataset("melanierb/ProteinSpace-TheraSAbDab-mAbs", "light_chain") ``` ### Data Fields - `sequence` (string): The amino acid sequence of the antibody chain (variable domain only, extracted via ANARCI) - `locus` (string): Chain type — "H" for heavy chain, "K" for kappa light chain, "L" for lambda light chain - `therapeutic` (string): Name of the therapeutic antibody - `genetics` (string): Genetic origin — "Genetically human", "Humanised", "Chimeric/Humanized", "Chimeric", or "Murine" - `anarci_numbers` (list of integers): IMGT position numbers for each residue in the sequence - `region_ends` (list of 7 integers): Cumulative end positions for [FWR1, CDR1, FWR2, CDR2, FWR3, CDR3, FWR4] - `v_genes` (string): Assigned V gene from germline analysis - `d_genes` (string): Assigned D gene from germline analysis (heavy chains only) - `j_genes` (string): Assigned J gene from germline analysis ### Data Splits This dataset does not have predefined splits. Users should create their own train/validation/test splits based on their specific use case. We recommend splitting at the therapeutic antibody level (not individual chains) to avoid data leakage. ## Dataset Creation ### Source Data The dataset is derived from Thera-SAbDab (Therapeutic Structural Antibody Database), available at: https://opig.stats.ox.ac.uk/webapps/newsabdab/therasabdab/ Thera-SAbDab is a curated database maintained by the Oxford Protein Informatics Group containing structural and sequence information for therapeutic antibodies. ### Data Collection and Processing The processing pipeline consisted of the following steps: 1. **Initial data:** Downloaded from Thera-SAbDab (1,133 therapeutic antibodies) 2. **Filtering criteria:** - Format: Only "Whole mAb" and "Whole mAb ADC" - Result: 858 therapeutic antibodies 3. **Chain separation:** - Heavy chains and light chains separated into individual records and stored as separate dataset configurations - Light chain locus mapped: Kappa → "K", Lambda → "L" 4. **ANARCI processing:** - IMGT numbering scheme applied to all sequences - Variable domain boundaries identified and extracted - CDR and framework regions annotated based on IMGT positions - Germline V, D, and J gene assignments obtained 5. **Region boundary calculation** based on IMGT region definitions: - Heavy chains: FWR1 (1–26), CDR1 (27–38), FWR2 (39–55), CDR2 (56–65), FWR3 (66–104), CDR3 (105–117), FWR4 (118–128) - Light chains: FWR1 (1–26), CDR1 (27–38), FWR2 (39–55), CDR2 (56–65), FWR3 (66–104), CDR3 (105–117), FWR4 (118–127) ### Annotations All annotations are computationally derived using ANARCI: - **IMGT numbering:** Automated using ANARCI with the IMGT scheme - **Germline genes:** Assigned by ANARCI's built-in germline assignment functionality - **Region boundaries:** Calculated based on IMGT position definitions No manual annotation was performed. ## Considerations for Using the Data ### Limitations - The dataset contains only whole mAbs and whole mAb ADCs; other antibody formats (Fabs, scFvs, bispecifics, etc.) are excluded - Sequences are limited to the variable domain as identified by ANARCI; constant regions are not included - A small number of sequences may have failed ANARCI processing and are excluded from the final dataset - Germline gene assignments are computational predictions and may not reflect the actual genetic origin in all cases - The dataset represents therapeutic antibodies as documented in Thera-SAbDab up to the download date (25.01.2026) and may not include the most recent therapeutics ## Additional Information ### Licensing Please refer to the Thera-SAbDab terms of use: https://opig.stats.ox.ac.uk/webapps/newsabdab/therasabdab/ This dataset is derived from publicly available therapeutic antibody sequences. Users should comply with all applicable licenses and terms of use from the original data source.

提供机构：

melanierb

5,000+

优质数据集

54 个

任务类型

进入经典数据集