five

YYJMAY/pmt-interaction

收藏
Hugging Face2025-11-18 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/YYJMAY/pmt-interaction
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-classification tags: - tcr - tcr-pmhc - peptide - mhc - immunology - binding-prediction - pmt size_categories: - 100K<n<1M --- # PMT Benchmark Dataset ## Dataset Description The PMT (Peptide-MHC-TCR) benchmark dataset for training and evaluating TCR-pMHC binding prediction models. This dataset contains TCR CDR3 sequences, peptide antigens, HLA alleles, and binary binding labels. ### Dataset Summary This is the official PMT training and in-distribution (ID) test set from the SPRINT framework. The data has been cleaned, deduplicated, and standardized for reproducibility. - **Training Set**: 474,881 samples - **ID Test Set**: 4,564 samples - **Task**: Binary classification (TCR-pMHC binding prediction) - **Modality**: TCR CDR3 + Peptide + MHC (PMT task) ## Dataset Structure ### Data Files - `train.csv`: Training data (474,881 samples) - `id_test.csv`: In-distribution test data (4,564 samples) ### Data Format CSV files with the following columns: | Column | Type | Description | |--------|------|-------------| | CDR3 | string | TCR CDR3beta amino acid sequence | | peptide | string | Peptide antigen sequence (8-15 aa) | | HLA | string | HLA allele (standardized format: A*02:01) | | label | int | Binding label (1=binder, 0=non-binder) | | HLA_sequence | string | HLA pseudo-sequence (optional) | ### Dataset Statistics #### Training Set - **Total Samples**: 474,881 - **Positive Samples**: 33,129 (7.0%) - **Negative Samples**: 441,752 (93.0%) - **Unique HLAs**: 78 - **Unique Peptides**: 638 - **Unique TCRs**: 32,853 #### ID Test Set - **Total Samples**: 4,564 - **Positive Samples**: 321 (7.0%) - **Negative Samples**: 4,243 (93.0%) - **Unique HLAs**: 12 - **Unique Peptides**: 190 - **Unique TCRs**: 1,283 ## Usage ### Load with Hugging Face Datasets ```python from datasets import load_dataset # Load training data dataset = load_dataset("YYJMAY/pmt-interaction", split="train") train_df = dataset.to_pandas() # Load test data dataset = load_dataset("YYJMAY/pmt-interaction", split="test") test_df = dataset.to_pandas() ``` ### Load with Pandas ```python import pandas as pd from huggingface_hub import hf_hub_download # Download training file train_path = hf_hub_download( repo_id="YYJMAY/pmt-interaction", filename="train.csv", repo_type="dataset" ) train_df = pd.read_csv(train_path) # Download test file test_path = hf_hub_download( repo_id="YYJMAY/pmt-interaction", filename="id_test.csv", repo_type="dataset" ) test_df = pd.read_csv(test_path) ``` ### Use with SPRINT Framework The SPRINT framework automatically downloads and uses this dataset: ```bash python scripts/run_benchmark.py --method METHOD --dataset pmt --mode train python scripts/run_benchmark.py --method METHOD --dataset pmt --mode eval ``` ## Data Quality ### Preprocessing - **Deduplication**: All duplicate entries removed based on (CDR3, peptide, HLA, label) - **HLA Standardization**: All HLA alleles normalized to standard format (e.g., A*02:01) - **Missing Values**: No missing values in required columns - **Label Validation**: All labels are binary (0 or 1) ### Peptide Length Distribution Training set peptide lengths: 8-15 amino acids Test set peptide lengths: 8-15 amino acids ## Construction This dataset was curated and cleaned as part of the SPRINT benchmarking framework: 1. Collected from multiple public TCR-pMHC datasets 2. Standardized HLA allele naming conventions 3. Removed duplicates and incomplete entries 4. Split into training and in-distribution test sets 5. Validated for data quality and consistency ## Tasks This dataset is designed for: - **PMT (Peptide-MHC-TCR) Task**: Predict TCR-pMHC binding using all three components - **Binary Classification**: Classify as binder (1) or non-binder (0) - **Model Benchmarking**: Evaluate model performance on standardized data ## Limitations - Only includes class I MHC (HLA-A, HLA-B, HLA-C) - Limited to TCR CDR3beta sequences - Binary labels (no binding affinity values) - Peptide length range: 8-15 amino acids ## Citation If you use this dataset, please cite: ```bibtex @dataset{pmt_benchmark_2024, title={PMT Benchmark Dataset for TCR-pMHC Binding Prediction}, author={SPRINT Framework Contributors}, year={2024}, url={https://huggingface.co/datasets/YYJMAY/pmt-interaction} } ``` ## License MIT License ## Contact For questions or issues, please open an issue in the SPRINT repository. ## Related Datasets - Allelic OOD: YYJMAY/allelic-ood - Temporal OOD: YYJMAY/temporal-ood - Modality OOD: YYJMAY/modality-ood
提供机构:
YYJMAY
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作