ComBat HarmonizR enables the integrated analysis of independently generated proteomic datasets through data harmonization with appropriate handling of missing values

NIAID Data Ecosystem2026-03-13 收录

下载链接：

https://www.omicsdi.org/dataset/pride/PXD027467

下载链接

链接失效反馈

官方服务：

资源简介：

The integration of proteomic datasets, generated by non-cooperating laboratories using different LC-MS/MS setups can overcome limitations in statistically underpowered sample cohorts but has not been demonstrated to this day. In proteomics, differences in sample preservation and preparation strategies, chromatography and mass spectrometry approaches and the used quantification strategy distort protein abundance distributions in integrated datasets. The Removal of these technical batch effects requires setup-specific normalization and strategies that can deal with missing at random (MAR) and missing not at random (MNAR) type values at a time. Algorithms for batch effect removal, such as the ComBat-algorithm, commonly used for other omics types, disregard proteins with MNAR missing values and reduce the informational yield and the effect size for combined datasets significantly. Here, we present a strategy for data harmonization across different tissue preservation techniques, LC-MS/MS instrumentation setups and quantification approaches. To enable batch effect removal without the need for data reduction or error-prone imputation we developed an extension to the ComBat algorithm, ´ComBat HarmonizR, that performs data harmonization with appropriate handling of MAR and MNAR missing values by matrix dissection The ComBat HarmonizR based strategy enables the combined analysis of independently generated proteomic datasets for the first time. Furthermore, we found ComBat HarmonizR to be superior for removing batch effects between different Tandem Mass Tag (TMT)-plexes, compared to commonly used internal reference scaling (iRS). Due to the matrix dissection approach without the need of data imputation, the HarmonizR algorithm can be applied to any type of -omics data while assuring minimal data loss

由不同非合作实验室采用各异液相色谱-串联质谱（LC-MS/MS）平台生成的蛋白质组数据集进行整合，可克服统计效力不足的样本队列带来的局限，但迄今为止尚未实现成功验证。在蛋白质组学研究中，样本保存与制备策略、色谱及质谱分析方法，以及所采用的定量策略之间的差异，会使整合数据集内的蛋白质丰度分布出现偏移。消除此类技术批次效应，需要针对实验平台特性开展标准化处理，且需同时能够处理随机缺失（missing at random, MAR）和非随机缺失（missing not at random, MNAR）两类缺失值。常用于其他组学领域的批次效应去除算法（如ComBat算法）会忽略存在MNAR缺失值的蛋白质，大幅降低整合数据集的信息产出量与效应量。在此，我们提出一种可适配不同组织保存技术、LC-MS/MS仪器平台及定量方法的数据整合策略。为实现无需数据缩减或易出错的插补操作即可完成批次效应去除，我们开发了ComBat算法的扩展工具——ComBat HarmonizR，该工具通过矩阵分解实现对MAR和MNAR缺失值的合理处理，从而完成数据整合。基于ComBat HarmonizR的策略首次实现了对独立生成的蛋白质组数据集的联合分析。此外，相较于常用的内部参照缩放（internal reference scaling, iRS）方法，我们发现ComBat HarmonizR在消除不同Tandem Mass Tag（TMT）-plex间的批次效应时表现更优。由于该方法采用矩阵分解策略且无需数据插补，HarmonizR算法可应用于任意类型的组学数据，同时确保数据损失降至最低。

创建时间：

2022-05-23

5,000+

优质数据集

54 个

任务类型

进入经典数据集