NeoMHCI Training and Evaluation Data
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/kmt8tx7gh6
下载链接
链接失效反馈官方服务:
资源简介:
All training and evaluation data used in the NeoMHCI study.
All code is freely available at https://github.com/ZhuLab-Fudan/NeoMHCI.
------------------------------------------------------
Train Data:
1. EL_train.zip:
Contains all eluted ligand data used for training the five-fold cross-validation model.
- EL2020_A.txt: Corresponds to the EL2020_A dataset mentioned in the article.
- EL2020_B.txt: Corresponds to the EL2020_B dataset mentioned in the article.
Combining the data from these two files constitutes the EL2020_C dataset mentioned in the article.
2. NE_train.zip
Contains all neoepitope data used during the fine-tuning process for the immunogenicity prediction task.
- NE2023.txt: Corresponds to the NE2023 dataset mentioned in the article.
- NE2023_list.json: A candidate pool constructed from all wild-type sequences in NE2023.
- IN2023: Corresponds to the IN2023 dataset mentioned in the article, used as a validation set during the fine-tuning process.
------------------------------------------------------
Test Data:
3. EL_test.zip
Contains the test set for ligand presentation prediction, along with the prediction scores of NeoMHCI and other comparison methods.
- IM2020.csv: Corresponds to the IM2020 test set mentioned in the article.
- IS2020.csv: Corresponds to the IS2020 test set mentioned in the article.
4. NE_test.zip
Contains the neoepitope test set for immunogenicity prediction, along with the prediction scores of NeoMHCI and other comparison methods.
- BM2023.csv: Corresponds to the BM2023 test set mentioned in the article.
- COVID-19.txt: Corresponds to the COVID-19 test set mentioned in the article.
- PM2018_data.txt: Corresponds to the PM2018 test set mentioned in the article. It includes `mutation_id` for mutation number, `patient_id` for patient number, `epitope` indicating whether the mutation is immunogenic, `tpm` for the gene expression level of the mutation, `cell_line` for the renamed multi-allele combination of the patient, with the specific correspondence in PM2018_allelelist. `pepseq` represents the specific sequence of the mutation. Each mutation is represented by all 8-11mer slices containing the mutation site, with the highest prediction value among all slices representing the prediction score for that mutation.
- PM2018_records.csv: Records the prediction scores of each method for every mutation with TPM>0.
- PM2018_allelelist: Records the multi-allele combinations expressed by each patient in PM2018.
- RV2023_data.txt, RV2023_records.txt, RV2023_allelelist: Same as PM2018
------------------------------------------------------
Common:
- allelelist: Records the specific MHC-I molecule combinations corresponding to the names of the multi-allele combinations (cell line) used in the MA data.
- MHC_pseudo.dat: Records the 34-mer pseudo sequences of MHC-I molecules.
- eval.py: Evaluation script used to compile various metrics from the records of each test set.
本NeoMHCI研究中使用的全部训练与评测数据如下。所有代码均可免费获取于https://github.com/ZhuLab-Fudan/NeoMHCI。
------------------------------------------------------
### 训练数据集
1. EL_train.zip:包含用于训练五折交叉验证(five-fold cross-validation)模型的全部洗脱配体(eluted ligand)数据。
- EL2020_A.txt:对应本文提及的EL2020_A数据集。
- EL2020_B.txt:对应本文提及的EL2020_B数据集。
将上述两个文件的数据合并,即可得到本文提及的EL2020_C数据集。
2. NE_train.zip:包含免疫原性预测(immunogenicity prediction)任务微调阶段使用的全部新表位(neoepitope)数据。
- NE2023.txt:对应本文提及的NE2023数据集。
- NE2023_list.json:从NE2023的全部野生型序列构建的候选池。
- IN2023:对应本文提及的IN2023数据集,在微调阶段用作验证集。
------------------------------------------------------
### 测试数据集
3. EL_test.zip:包含配体呈递预测任务的测试集,以及NeoMHCI与其他对比方法的预测评分。
- IM2020.csv:对应本文提及的IM2020测试集。
- IS2020.csv:对应本文提及的IS2020测试集。
4. NE_test.zip:包含免疫原性预测任务的新表位测试集,以及NeoMHCI与其他对比方法的预测评分。
- BM2023.csv:对应本文提及的BM2023测试集。
- COVID-19.txt:对应本文提及的COVID-19测试集。
- PM2018_data.txt:对应本文提及的PM2018测试集。该文件包含以下字段:`mutation_id`(突变编号)、`patient_id`(患者编号)、`epitope`(用于标注该突变是否具有免疫原性)、`tpm`(突变所在基因的转录本表达水平)、`cell_line`(患者重命名后的多等位基因组合,即细胞系),其具体对应关系详见PM2018_allelelist;`pepseq`代表该突变对应的具体肽序列。每条突变由包含该突变位点的全部8~11肽剪切片段表示,所有片段中的最高预测值即为该突变的预测评分。
- PM2018_records.csv:记录了所有TPM>0的突变的各方法预测评分。
- PM2018_allelelist:记录了PM2018数据集中每位患者表达的多等位基因组合。
- RV2023_data.txt、RV2023_records.txt、RV2023_allelelist:格式与用途均与PM2018对应文件一致。
------------------------------------------------------
### 通用文件
- allelelist:记录了MA数据中使用的多等位基因组合(即cell_line,细胞系)名称对应的具体MHC-I(主要组织相容性复合体I类)分子组合。
- MHC_pseudo.dat:记录了MHC-I分子的34肽段伪序列。
- eval.py:用于从各测试集的记录文件中计算各类评价指标的评测脚本。
创建时间:
2024-10-14



