YYJMAY/modality-ood

Name: YYJMAY/modality-ood
Creator: YYJMAY
Published: 2025-11-18 08:42:21
License: 暂无描述

Hugging Face2025-11-18 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/YYJMAY/modality-ood

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-classification tags: - mhc - peptide - immunology - out-of-distribution - modality-shift - binding-affinity - eluted-ligand size_categories: - 1M<n<10M --- # Modality OOD Dataset ## Dataset Description The Modality OOD dataset tests model generalization across **different data modalities** in peptide-MHC (pMHC) binding prediction. It contains two complementary datasets representing distinct experimental measurement types: - **BA (Binding Affinity)**: In vitro binding affinity measurements with continuous values - **EL (Eluted Ligand)**: Mass spectrometry-based eluted ligand data with binary labels ### Key Features - **Modality Shift Testing**: Evaluates if models trained on one modality (e.g., BA) can generalize to another (e.g., EL) - **Real-World Relevance**: Reflects the practical challenge of applying models across different experimental platforms - **Large Scale**: Combined 3.85M samples across 130+ HLA alleles - **Single Allele Format**: Each sample has one peptide-HLA pair (no multi-allele) ### Biological Significance **Why Two Modalities Matter:** 1. **Binding Affinity (BA)**: - Measures peptide-MHC binding strength in controlled conditions - Continuous scale (0 = no binding, 1 = strong binding) - Reflects thermodynamic stability - Common in immunoinformatics training data 2. **Eluted Ligand (EL)**: - Peptides naturally presented on cell surface MHC molecules - Binary label (1 = naturally presented, 0 = not presented) - Reflects cellular processing, TAP transport, and MHC loading - More biologically relevant but harder to obtain **The Modality Gap:** Models trained on BA data often fail on EL data (and vice versa) because: - BA measures binding only, EL captures the full antigen processing pathway - Different experimental biases and noise profiles - Label semantics differ (affinity vs. presentation) This dataset enables testing cross-modality generalization. ## Dataset Structure ### Files - **ba_s.csv**: Binding Affinity dataset (single allele) - **el_s.csv**: Eluted Ligand dataset (single allele) ### Data Format Both files share the same schema: | Column | Type | Description | Required | |--------|------|-------------|----------| | peptide | string | Peptide amino acid sequence (8-15aa) | Yes | | HLA | string | HLA allele (e.g., HLA-A02:01, HLA-B07:02) | Yes | | label | float/int | BA: continuous 0-1, EL: binary 0/1 | Yes | | HLA_sequence | string | HLA pseudo-sequence | Yes | ### Dataset Statistics #### BA (Binding Affinity) - **Total Samples**: 170,470 - **Label Type**: Continuous (0.0 to 1.0) - **Mean Affinity**: 0.2547 - **Median Affinity**: 0.0847 - **Unique HLA Alleles**: 111 - **Peptide Lengths**: 8-14 amino acids (74.3% are 9-mers) - **File Size**: 10.61 MB #### EL (Eluted Ligand) - **Total Samples**: 3,679,405 - **Label Type**: Binary classification - **Positive Samples**: 197,547 (5.4%) - **Negative Samples**: 3,481,858 (94.6%) - **Unique HLA Alleles**: 130 - **Peptide Lengths**: 8-15 amino acids (distributed across all lengths) - **File Size**: 213.35 MB ### Combined Statistics - **Total Samples**: 3,849,875 - **Unique HLA Coverage**: 130+ alleles across HLA-A, B, C - **Modalities**: 2 (BA and EL) - **Task Type**: Peptide-MHC (PM) binding prediction ## Usage ### Load with Pandas ```python from huggingface_hub import hf_hub_download import pandas as pd # Download BA dataset ba_file = hf_hub_download( repo_id="YYJMAY/modality-ood", filename="ba_s.csv", repo_type="dataset" ) ba_df = pd.read_csv(ba_file) # Download EL dataset el_file = hf_hub_download( repo_id="YYJMAY/modality-ood", filename="el_s.csv", repo_type="dataset" ) el_df = pd.read_csv(el_file) ``` ### Use with SPRINT Framework ```python from sprint.core.dataset_manager import DatasetManager manager = DatasetManager() config = { 'hf_repo': 'YYJMAY/modality-ood', 'files': ['ba_s.csv', 'el_s.csv'], 'ba': 'ba_s.csv', 'el': 'el_s.csv' } files = manager.get_dataset('modality_ood', config) ba_file = files['ba'] el_file = files['el'] ``` ### Example: Cross-Modality Evaluation ```python import pandas as pd from your_model import YourModel # Load data ba_df = pd.read_csv(ba_file) el_df = pd.read_csv(el_file) # Scenario 1: Train on BA, test on EL model = YourModel() model.train(ba_df) el_predictions = model.predict(el_df) # Scenario 2: Train on EL, test on BA model = YourModel() model.train(el_df) ba_predictions = model.predict(ba_df) # Evaluate cross-modality generalization ``` ## Experimental Design ### Recommended Evaluation Scenarios 1. **BA → EL Generalization** - Train on BA (continuous labels) - Test on EL (binary labels) - Measures if affinity-based models predict presentation 2. **EL → BA Generalization** - Train on EL (binary labels) - Test on BA (continuous labels) - Measures if presentation-based models predict affinity 3. **Mixed Training** - Train on both BA and EL - Test separately on each - Measures multi-task learning benefits 4. **Modality-Specific Training** - Train and test on same modality - Baseline for comparison ### Metrics Considerations - **For BA**: Use regression metrics (MSE, MAE, Pearson correlation) - **For EL**: Use classification metrics (AUC, F1, precision, recall) - **Cross-modal**: May need to binarize BA predictions or convert EL to scores ## Construction Method Both datasets were constructed to ensure: 1. **Single Allele Format**: Each sample has exactly one HLA allele 2. **Quality Control**: - No missing values in required columns - No duplicate peptide-HLA-label combinations - Peptide lengths filtered to 8-15 amino acids 3. **Standardized HLA Format**: HLA-A02:01 format (with hyphen prefix) 4. **Representative Coverage**: 130+ HLA alleles across major supertypes 5. **Balanced Lengths**: Both datasets include diverse peptide lengths ## Citation If you use this dataset, please cite: ```bibtex @dataset{modality_ood_2024, title={Modality OOD Dataset for Peptide-MHC Binding Prediction}, author={SPRINT Framework Contributors}, year={2024}, url={https://huggingface.co/datasets/YYJMAY/modality-ood} } ``` ## Related Datasets - **Allelic OOD**: Tests generalization to rare HLA alleles - **Temporal OOD**: Tests generalization to new data over time ## Notes - **No CDR3 sequences**: These datasets are for PM (Peptide-MHC) tasks only, not PMT (Peptide-MHC-TCR) - **Label semantics differ**: BA is continuous affinity, EL is binary presentation - **Experimental platforms differ**: BA from in vitro assays, EL from mass spectrometry - **Biological processes differ**: BA measures binding only, EL captures full pathway ## License MIT License ## Contact For questions or issues, please open an issue on the dataset repository. --- **Keywords**: peptide-MHC binding, immunology, binding affinity, eluted ligand, modality shift, out-of-distribution, generalization, cross-modal learning

提供机构：

YYJMAY

5,000+

优质数据集

54 个

任务类型

进入经典数据集