NeoMHCI Training and Evaluation Data
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://data.mendeley.com/datasets/kmt8tx7gh6
下载链接
链接失效反馈官方服务:
资源简介:
All training and evaluation data used in the NeoMHCI study.
All code is freely available at https://github.com/ZhuLab-Fudan/NeoMHCI.
------------------------------------------------------
Train Data:
1. EL_train.zip:
Contains all eluted ligand data used for training the five-fold cross-validation model.
- EL2020_A.txt: Corresponds to the EL2020_A dataset mentioned in the article.
- EL2020_B.txt: Corresponds to the EL2020_B dataset mentioned in the article.
Combining the data from these two files constitutes the EL2020_C dataset mentioned in the article.
2. NE_train.zip
Contains all neoepitope data used during the fine-tuning process for the immunogenicity prediction task.
- NE2023.txt: Corresponds to the NE2023 dataset mentioned in the article.
- NE2023_list.json: A candidate pool constructed from all wild-type sequences in NE2023.
- IN2023: Corresponds to the IN2023 dataset mentioned in the article, used as a validation set during the fine-tuning process.
------------------------------------------------------
Test Data:
3. EL_test.zip
Contains the test set for ligand presentation prediction, along with the prediction scores of NeoMHCI and other comparison methods.
- IM2020.csv: Corresponds to the IM2020 test set mentioned in the article.
- IS2020.csv: Corresponds to the IS2020 test set mentioned in the article.
4. NE_test.zip
Contains the neoepitope test set for immunogenicity prediction, along with the prediction scores of NeoMHCI and other comparison methods.
- BM2023.csv: Corresponds to the BM2023 test set mentioned in the article.
- COVID-19.txt: Corresponds to the COVID-19 test set mentioned in the article.
- PM2018_data.txt: Corresponds to the PM2018 test set mentioned in the article. It includes `mutation_id` for mutation number, `patient_id` for patient number, `epitope` indicating whether the mutation is immunogenic, `tpm` for the gene expression level of the mutation, `cell_line` for the renamed multi-allele combination of the patient, with the specific correspondence in PM2018_allelelist. `pepseq` represents the specific sequence of the mutation. Each mutation is represented by all 8-11mer slices containing the mutation site, with the highest prediction value among all slices representing the prediction score for that mutation.
- PM2018_records.csv: Records the prediction scores of each method for every mutation with TPM>0.
- PM2018_allelelist: Records the multi-allele combinations expressed by each patient in PM2018.
- RV2023_data.txt, RV2023_records.txt, RV2023_allelelist: Same as PM2018
------------------------------------------------------
Common:
- allelelist: Records the specific MHC-I molecule combinations corresponding to the names of the multi-allele combinations (cell line) used in the MA data.
- MHC_pseudo.dat: Records the 34-mer pseudo sequences of MHC-I molecules.
- eval.py: Evaluation script used to compile various metrics from the records of each test set.
创建时间:
2024-10-14



