Correcting batch effects in large-scale multiomics studies using a reference-material-based ratio method
收藏DataCite Commons2023-02-28 更新2024-08-18 收录
下载链接:
https://figshare.com/articles/dataset/Correcting_batch_effects_in_large-scale_multiomics_studies_using_a_reference-material-based_ratio_method/22188349
下载链接
链接失效反馈官方服务:
资源简介:
As part of the Quartet Project for quality control and data integration of multiomics profiling, we comprehensively assessed the performance of seven batch-effect correction algorithms (BECAs) based on different performance metrics of clinical relevance. Fifteen batches of transcriptomics, proteomics, and metabolomics data from different platforms, labs and with different data quality were employed and referred as full datasets in this study. In the full datasets, each batch comprised 12 libraries, consisting of 12 tubes with each representing one of the triplicates of a donor (D5, D6, F7 and M8). Therefore, 180 libraries (12 libraries per batch x 15 batches) were included in full datasets at each omics level. We then employed a subset of datasets from the full datasets to create balanced and confounded scenarios for assessing the pros and cons of the BECAs. Here, we arbitrarily selected D6 as the common reference material, leaving the rest three as the study groups (D5, F7, and M8). In the balanced experiment scenario, one replicate was selected for each study group from each of 15 batches. This was done independently for each omics type. In the confounded experiment scenario, 5 batches were randomly assigned to each study group (D5, F7, or M8) for each omics type to extract all three replicates for the assigned study group. For both scenarios, all three replicates for the selected reference sample (D6) in each batch were retained for reference-sample-based BECAs. Therefore, 45 study samples and 45 reference samples in balanced and confounded scenarios were employed at each omics level. The experimental design ensured the consistent number of libraries included in the balanced and confounded scenarios, as well as the separation of study samples from the reference samples for objective evaluation of the impact of BECAs. Data analysis methods used in the study were as follows. (1) For transcriptomics, RNAseq reads were aligned using HISAT2 and genes were quantified using StringTie followed by Ballgown. The normalized data in Fragments Per Kilobase of transcript per Million mapped reads (FPKM) were obtained. A floor value of 0.01 was added to the FPKM value of each gene, and log2 transformation was then conducted. (2) For proteomics, MS raw files were searched against the human Refseq protein database using Firmiana 1.0 enabled with Mascot 2.3 (Matrix Science Inc) . False discovery rate (FDR) by using a target-decoy strategy was set to 1% for both proteins and peptides. Proteins were then quantified using the label-free iBAQ approach. The fraction-of-total (FOT) was used to represent the normalized abundance of a particular protein, which was defined as a protein’s iBAQ value divided by the total iBAQ of all identified proteins within one sample. A floor value of 0.01 was then added to the value of each protein, and log2 transformation was conducted. (3) For metabolomics, raw data were extracted, peak-identified and QC processed using the in-house methods in each lab. Compound identification was conducted using in-house library based on the retention time/index (RI), mass to charge ratio (m/z), and MS spectral data for each metabolite. Metabolite quantification was conducted using area-under-the-curve or the concentration calculated by calibration curve using standards of each metabolite. A floor value of 1 was then added to the value of each metabolite, and log2 transformation was conducted.
提供机构:
figshare
创建时间:
2023-02-28



