Code + Data for COG Identification

Name: Code + Data for COG Identification
Creator: figshare
Published: 2025-11-14 00:06:34
License: 暂无描述

DataCite Commons2025-11-14 更新2026-04-25 收录

下载链接：

https://figshare.com/articles/dataset/Code_Data_for_COG_Identification/30615452/1

下载链接

链接失效反馈

官方服务：

资源简介：

To identify clusters of orthologous genes (COGs) that correlate with nutrient limitation in the modern ocean, we examined the Ocean Microbial Reference Catalog v2 (OM-RGC.v2) from the Tara Oceans Project. The OM-RGC.v2 includes relative gene abundances of all COGs (n = 4,787) in 139 Tara Oceans metagenomic samples, along with metadata information including phosphate, oxygen, and nitrate/nitrite concentrations. (Nitrate/nitrite values were reported together for OM-RGC v2.) Iron concentrations for Tara Oceans samples were not available and were thus estimated using the PISCES2 model based on iron concentration model predictions for Tara Oceans sampling locations as described in Table S1 of Caputi et al., 2019. Iron concentrations were predicted for surface and the deep chlorophyll maximum (DCM) only; iron concentrations for samples from the mesopelagic zone were not available under the PISCES2 model. All other metadata for Tara Oceans samples were directly obtained from Salazar et al., 2019.Estimation of correlations between COGs and metadata information was performed using regression models. Compound poisson linear models were fitted in bulk using the MaAsLin2 software package (v. 1.18.0). Separate models were fit for each COG to analyze the effect of metadata variables on individual COG abundances. While the main focus was to investigate correlation with nutrient abundance, environmental metadata was included in the model to control for as many potential confounding effects as the data allowed. The following predictors were included in the final model (based on variables available from the Tara Oceans dataset): the size fraction at which the sample was taken, mean temperature, depth, salinity, mean oxygen concentration, PO4 concentration, NO2 + NO3 concentration, iron concentration, and absolute latitude. Of these, the following predictors were log-transformed to allow greater model fit: depth, PO4 concentration, NO2 + NO3 concentration. To the same end, the iron concentration was transformed by taking the square root, and the absolute value of the latitude was taken. Otherwise, no transformations or normalization was performed. No abundance cutoff was applied, but COGs present in less than one-third of the Tara Oceans samples were discarded in order to ensure that the COGs identified by the statistical model were meaningful.

为了筛选与现代海洋营养限制相关的直系同源基因簇（clusters of orthologous genes, COGs），本研究分析了塔拉海洋（Tara）科考项目产出的海洋微生物参考目录v2（Ocean Microbial Reference Catalog v2, OM-RGC.v2）数据集。OM-RGC.v2包含139个塔拉海洋宏基因组样本中全部4787个COGs的相对基因丰度数据，同时附带元数据信息，包括磷酸盐、氧气以及硝酸盐/亚硝酸盐的浓度。OM-RGC v2中硝酸盐与亚硝酸盐浓度合并上报。由于塔拉海洋样本的铁浓度数据未公开，本研究基于Caputi等人2019年研究的表S1所述方法，利用PISCES2模型结合塔拉海洋采样点位的铁浓度模型预测值对其进行估算。该模型仅能预测表层海水与叶绿素最大深度层（deep chlorophyll maximum, DCM）的铁浓度，海洋中层带采样样本的铁浓度无法通过PISCES2模型获取。其余塔拉海洋样本的元数据均直接取自Salazar等人2019年的研究成果。本研究采用回归模型分析COGs与元数据间的相关性：通过MaAsLin2软件包（v.1.18.0）批量拟合复合泊松线性模型，并为每个COG单独构建模型以探究元数据变量对单个COG丰度的影响。尽管研究核心聚焦于营养盐丰度的相关性分析，模型仍纳入了环境元数据以尽可能控制数据可及范围内的潜在混杂效应。最终纳入模型的预测变量（基于塔拉海洋数据集的可用变量）包括：样本采集的粒径分级、平均温度、采样深度、盐度、平均氧浓度、磷酸盐（PO₄）浓度、亚硝酸盐+硝酸盐（NO₂ + NO₃）浓度、铁浓度以及绝对纬度。其中，采样深度、PO₄浓度、NO₂ + NO₃浓度经对数转换以提升模型拟合效果；为达到相同目的，铁浓度经平方根转换，纬度则取其绝对值。除此之外，未进行其他转换或标准化操作。本研究未设置丰度阈值，但剔除了在不足三分之一的塔拉海洋样本中出现的COGs，以确保统计模型筛选出的COGs具备生物学意义。

提供机构：

figshare

创建时间：

2025-11-14

5,000+

优质数据集

54 个

任务类型

进入经典数据集