five

NeoMHCI Training and Evaluation Data

收藏
Mendeley Data2024-06-25 更新2024-06-26 收录
下载链接:
https://data.mendeley.com/datasets/kmt8tx7gh6
下载链接
链接失效反馈
官方服务:
资源简介:
All training and evaluation data used in the NeoMHCI study. All code is freely available at https://github.com/ZhuLab-Fudan/NeoMHCI. ------------------------------------------------------ Train Data: 1. EL_train.zip: Contains all eluted ligand data used for training the five-fold cross-validation model. - EL2020_A.txt: Corresponds to the EL2020_A dataset mentioned in the article. - EL2020_B.txt: Corresponds to the EL2020_B dataset mentioned in the article. Combining the data from these two files constitutes the EL2020_C dataset mentioned in the article. 2. NE_train.zip Contains all neoepitope data used during the fine-tuning process for the immunogenicity prediction task. - NE2023.txt: Corresponds to the NE2023 dataset mentioned in the article. - NE2023_list.json: A candidate pool constructed from all wild-type sequences in NE2023. - IN2023: Corresponds to the IN2023 dataset mentioned in the article, used as a validation set during the fine-tuning process. ------------------------------------------------------ Test Data: 3. EL_test.zip Contains the test set for ligand presentation prediction, along with the prediction scores of NeoMHCI and other comparison methods. - IM2020.csv: Corresponds to the IM2020 test set mentioned in the article. - IS2020.csv: Corresponds to the IS2020 test set mentioned in the article. 4. NE_test.zip Contains the neoepitope test set for immunogenicity prediction, along with the prediction scores of NeoMHCI and other comparison methods. - BM2023.csv: Corresponds to the BM2023 test set mentioned in the article. - PM2018_data.txt: Corresponds to the PM2018 test set mentioned in the article. It includes `mutation_id` for mutation number, `patient_id` for patient number, `epitope` indicating whether the mutation is immunogenic, `tpm` for the gene expression level of the mutation, `cell_line` for the renamed multi-allele combination of the patient, with the specific correspondence in PM2018_allelelist. `pepseq` represents the specific sequence of the mutation. Each mutation is represented by all 8-11mer slices containing the mutation site, with the highest prediction value among all slices representing the prediction score for that mutation. - PM2018_records.csv: Records the prediction scores of each method for every mutation with TPM>0. - PM2018_allelelist: Records the multi-allele combinations expressed by each patient in PM2018. ------------------------------------------------------ Common: - allelelist: Records the specific MHC-I molecule combinations corresponding to the names of the multi-allele combinations (cell line) used in the MA data. - MHC_pseudo.dat: Records the 34-mer pseudo sequences of MHC-I molecules. - eval.py: Evaluation script used to compile various metrics from the records of each test set.

本文件包含NeoMHCI研究中使用的全部训练与评测数据。所有代码可在https://github.com/ZhuLab-Fudan/NeoMHCI免费获取。 ------------------------------------------------------ 训练数据: 1. EL_train.zip:包含用于训练五折交叉验证模型的全部洗脱配体数据。 - EL2020_A.txt:对应文中提及的EL2020_A数据集。 - EL2020_B.txt:对应文中提及的EL2020_B数据集。将这两个文件的数据合并后,即可得到文中所述的EL2020_C数据集。 2. NE_train.zip:包含免疫原性预测任务微调过程中使用的全部新表位(neoepitope)数据。 - NE2023.txt:对应文中提及的NE2023数据集。 - NE2023_list.json:由NE2023中的所有野生型序列构建的候选池。 - IN2023:对应文中提及的IN2023数据集,在微调过程中用作验证集。 ------------------------------------------------------ 测试数据: 3. EL_test.zip:包含配体呈递预测任务的测试集,以及NeoMHCI与其他对比方法的预测分数。 - IM2020.csv:对应文中提及的IM2020测试集。 - IS2020.csv:对应文中提及的IS2020测试集。 4. NE_test.zip:包含免疫原性预测任务的新表位测试集,以及NeoMHCI与其他对比方法的预测分数。 - BM2023.csv:对应文中提及的BM2023测试集。 - PM2018_data.txt:对应文中提及的PM2018测试集。该文件包含以下字段:`mutation_id`(突变编号)、`patient_id`(患者编号)、`epitope`(标记该突变是否具有免疫原性)、`tpm`(突变所在基因的表达水平)、`cell_line`(患者重命名后的多等位基因组合,具体对应关系详见PM2018_allelelist),`pepseq`代表该突变的具体序列。每个突变由所有包含突变位点的8-11聚体片段表征,取所有片段中的最高预测值作为该突变的预测分数。 - PM2018_records.csv:记录了每种方法对所有TPM>0的突变的预测分数。 - PM2018_allelelist:记录了PM2018数据集中每位患者表达的多等位基因组合。 ------------------------------------------------------ 通用文件: - allelelist:记录了MA数据中使用的多等位基因组合(cell_line)名称对应的主要组织相容性复合体I类(MHC-I)分子组合。 - MHC_pseudo.dat:记录了主要组织相容性复合体I类(MHC-I)分子的34聚体伪序列。 - eval.py:用于从各测试集的记录中计算各类评估指标的评测脚本。
创建时间:
2024-06-13
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作