five

NeoMHCI Training and Evaluation Data

收藏
Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/kmt8tx7gh6
下载链接
链接失效反馈
官方服务:
资源简介:
All training and evaluation data used in the NeoMHCI study. All code is freely available at https://github.com/ZhuLab-Fudan/NeoMHCI. ------------------------------------------------------ Train Data: 1. EL_train.zip: Contains all eluted ligand data used for training the five-fold cross-validation model. - EL2020_A.txt: Corresponds to the EL2020_A dataset mentioned in the article. - EL2020_B.txt: Corresponds to the EL2020_B dataset mentioned in the article. Combining the data from these two files constitutes the EL2020_C dataset mentioned in the article. 2. NE_train.zip Contains all neoepitope data used during the fine-tuning process for the immunogenicity prediction task. - NE2023.txt: Corresponds to the NE2023 dataset mentioned in the article. - NE2023_list.json: A candidate pool constructed from all wild-type sequences in NE2023. - IN2023: Corresponds to the IN2023 dataset mentioned in the article, used as a validation set during the fine-tuning process. ------------------------------------------------------ Test Data: 3. EL_test.zip Contains the test set for ligand presentation prediction, along with the prediction scores of NeoMHCI and other comparison methods. - IM2020.csv: Corresponds to the IM2020 test set mentioned in the article. - IS2020.csv: Corresponds to the IS2020 test set mentioned in the article. 4. NE_test.zip Contains the neoepitope test set for immunogenicity prediction, along with the prediction scores of NeoMHCI and other comparison methods. - BM2023.csv: Corresponds to the BM2023 test set mentioned in the article. - COVID-19.txt: Corresponds to the COVID-19 test set mentioned in the article. - PM2018_data.txt: Corresponds to the PM2018 test set mentioned in the article. It includes `mutation_id` for mutation number, `patient_id` for patient number, `epitope` indicating whether the mutation is immunogenic, `tpm` for the gene expression level of the mutation, `cell_line` for the renamed multi-allele combination of the patient, with the specific correspondence in PM2018_allelelist. `pepseq` represents the specific sequence of the mutation. Each mutation is represented by all 8-11mer slices containing the mutation site, with the highest prediction value among all slices representing the prediction score for that mutation. - PM2018_records.csv: Records the prediction scores of each method for every mutation with TPM>0. - PM2018_allelelist: Records the multi-allele combinations expressed by each patient in PM2018. - RV2023_data.txt, RV2023_records.txt, RV2023_allelelist: Same as PM2018 ------------------------------------------------------ Common: - allelelist: Records the specific MHC-I molecule combinations corresponding to the names of the multi-allele combinations (cell line) used in the MA data. - MHC_pseudo.dat: Records the 34-mer pseudo sequences of MHC-I molecules. - eval.py: Evaluation script used to compile various metrics from the records of each test set.

本NeoMHCI研究中使用的全部训练与评测数据如下。所有代码均可免费获取于https://github.com/ZhuLab-Fudan/NeoMHCI。 ------------------------------------------------------ ### 训练数据集 1. EL_train.zip:包含用于训练五折交叉验证(five-fold cross-validation)模型的全部洗脱配体(eluted ligand)数据。 - EL2020_A.txt:对应本文提及的EL2020_A数据集。 - EL2020_B.txt:对应本文提及的EL2020_B数据集。 将上述两个文件的数据合并,即可得到本文提及的EL2020_C数据集。 2. NE_train.zip:包含免疫原性预测(immunogenicity prediction)任务微调阶段使用的全部新表位(neoepitope)数据。 - NE2023.txt:对应本文提及的NE2023数据集。 - NE2023_list.json:从NE2023的全部野生型序列构建的候选池。 - IN2023:对应本文提及的IN2023数据集,在微调阶段用作验证集。 ------------------------------------------------------ ### 测试数据集 3. EL_test.zip:包含配体呈递预测任务的测试集,以及NeoMHCI与其他对比方法的预测评分。 - IM2020.csv:对应本文提及的IM2020测试集。 - IS2020.csv:对应本文提及的IS2020测试集。 4. NE_test.zip:包含免疫原性预测任务的新表位测试集,以及NeoMHCI与其他对比方法的预测评分。 - BM2023.csv:对应本文提及的BM2023测试集。 - COVID-19.txt:对应本文提及的COVID-19测试集。 - PM2018_data.txt:对应本文提及的PM2018测试集。该文件包含以下字段:`mutation_id`(突变编号)、`patient_id`(患者编号)、`epitope`(用于标注该突变是否具有免疫原性)、`tpm`(突变所在基因的转录本表达水平)、`cell_line`(患者重命名后的多等位基因组合,即细胞系),其具体对应关系详见PM2018_allelelist;`pepseq`代表该突变对应的具体肽序列。每条突变由包含该突变位点的全部8~11肽剪切片段表示,所有片段中的最高预测值即为该突变的预测评分。 - PM2018_records.csv:记录了所有TPM>0的突变的各方法预测评分。 - PM2018_allelelist:记录了PM2018数据集中每位患者表达的多等位基因组合。 - RV2023_data.txt、RV2023_records.txt、RV2023_allelelist:格式与用途均与PM2018对应文件一致。 ------------------------------------------------------ ### 通用文件 - allelelist:记录了MA数据中使用的多等位基因组合(即cell_line,细胞系)名称对应的具体MHC-I(主要组织相容性复合体I类)分子组合。 - MHC_pseudo.dat:记录了MHC-I分子的34肽段伪序列。 - eval.py:用于从各测试集的记录文件中计算各类评价指标的评测脚本。
创建时间:
2024-10-14
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作