NeoMHCI Training and Evaluation Data
收藏Mendeley Data2024-06-25 更新2024-06-26 收录
下载链接:
https://data.mendeley.com/datasets/kmt8tx7gh6
下载链接
链接失效反馈官方服务:
资源简介:
All training and evaluation data used in the NeoMHCI study. All code is freely available at https://github.com/ZhuLab-Fudan/NeoMHCI. ------------------------------------------------------ Train Data: 1. EL_train.zip: Contains all eluted ligand data used for training the five-fold cross-validation model. - EL2020_A.txt: Corresponds to the EL2020_A dataset mentioned in the article. - EL2020_B.txt: Corresponds to the EL2020_B dataset mentioned in the article. Combining the data from these two files constitutes the EL2020_C dataset mentioned in the article. 2. NE_train.zip Contains all neoepitope data used during the fine-tuning process for the immunogenicity prediction task. - NE2023.txt: Corresponds to the NE2023 dataset mentioned in the article. - NE2023_list.json: A candidate pool constructed from all wild-type sequences in NE2023. - IN2023: Corresponds to the IN2023 dataset mentioned in the article, used as a validation set during the fine-tuning process. ------------------------------------------------------ Test Data: 3. EL_test.zip Contains the test set for ligand presentation prediction, along with the prediction scores of NeoMHCI and other comparison methods. - IM2020.csv: Corresponds to the IM2020 test set mentioned in the article. - IS2020.csv: Corresponds to the IS2020 test set mentioned in the article. 4. NE_test.zip Contains the neoepitope test set for immunogenicity prediction, along with the prediction scores of NeoMHCI and other comparison methods. - BM2023.csv: Corresponds to the BM2023 test set mentioned in the article. - PM2018_data.txt: Corresponds to the PM2018 test set mentioned in the article. It includes `mutation_id` for mutation number, `patient_id` for patient number, `epitope` indicating whether the mutation is immunogenic, `tpm` for the gene expression level of the mutation, `cell_line` for the renamed multi-allele combination of the patient, with the specific correspondence in PM2018_allelelist. `pepseq` represents the specific sequence of the mutation. Each mutation is represented by all 8-11mer slices containing the mutation site, with the highest prediction value among all slices representing the prediction score for that mutation. - PM2018_records.csv: Records the prediction scores of each method for every mutation with TPM>0. - PM2018_allelelist: Records the multi-allele combinations expressed by each patient in PM2018. ------------------------------------------------------ Common: - allelelist: Records the specific MHC-I molecule combinations corresponding to the names of the multi-allele combinations (cell line) used in the MA data. - MHC_pseudo.dat: Records the 34-mer pseudo sequences of MHC-I molecules. - eval.py: Evaluation script used to compile various metrics from the records of each test set.
本文件包含NeoMHCI研究中使用的全部训练与评测数据。所有代码可在https://github.com/ZhuLab-Fudan/NeoMHCI免费获取。
------------------------------------------------------
训练数据:
1. EL_train.zip:包含用于训练五折交叉验证模型的全部洗脱配体数据。
- EL2020_A.txt:对应文中提及的EL2020_A数据集。
- EL2020_B.txt:对应文中提及的EL2020_B数据集。将这两个文件的数据合并后,即可得到文中所述的EL2020_C数据集。
2. NE_train.zip:包含免疫原性预测任务微调过程中使用的全部新表位(neoepitope)数据。
- NE2023.txt:对应文中提及的NE2023数据集。
- NE2023_list.json:由NE2023中的所有野生型序列构建的候选池。
- IN2023:对应文中提及的IN2023数据集,在微调过程中用作验证集。
------------------------------------------------------
测试数据:
3. EL_test.zip:包含配体呈递预测任务的测试集,以及NeoMHCI与其他对比方法的预测分数。
- IM2020.csv:对应文中提及的IM2020测试集。
- IS2020.csv:对应文中提及的IS2020测试集。
4. NE_test.zip:包含免疫原性预测任务的新表位测试集,以及NeoMHCI与其他对比方法的预测分数。
- BM2023.csv:对应文中提及的BM2023测试集。
- PM2018_data.txt:对应文中提及的PM2018测试集。该文件包含以下字段:`mutation_id`(突变编号)、`patient_id`(患者编号)、`epitope`(标记该突变是否具有免疫原性)、`tpm`(突变所在基因的表达水平)、`cell_line`(患者重命名后的多等位基因组合,具体对应关系详见PM2018_allelelist),`pepseq`代表该突变的具体序列。每个突变由所有包含突变位点的8-11聚体片段表征,取所有片段中的最高预测值作为该突变的预测分数。
- PM2018_records.csv:记录了每种方法对所有TPM>0的突变的预测分数。
- PM2018_allelelist:记录了PM2018数据集中每位患者表达的多等位基因组合。
------------------------------------------------------
通用文件:
- allelelist:记录了MA数据中使用的多等位基因组合(cell_line)名称对应的主要组织相容性复合体I类(MHC-I)分子组合。
- MHC_pseudo.dat:记录了主要组织相容性复合体I类(MHC-I)分子的34聚体伪序列。
- eval.py:用于从各测试集的记录中计算各类评估指标的评测脚本。
创建时间:
2024-06-13



