Expression-based machine learning models for predicting plant tissue identity
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.4b8gthtn7
下载链接
链接失效反馈官方服务:
资源简介:
The selection of Arabidopsis as a model organism played a pivotal role in advancing genomic science. Competing frameworks to select an agricultural- or ecological-based model species were selected against in favor of building knowledge in a species that would facilitate genome-enabled research. Here, we examine the ability of models based on Arabidopsis gene expression data to predict tissue identity in other flowering plants. Comparing different machine learning algorithms, models trained and tested on Arabidopsis data achieved near-perfect precision and recall values, whereas when tissue identity is predicted across the flowering plants using models trained on Arabidopsis data, precision values range from 0.69 to 0.74 and recall from 0.54 to 0.64. Below-ground tissue is more predictable than other tissue types, and the ability to predict tissue identity is not correlated with phylogenetic distance from Arabidopsis. K-Nearest Neighbors is the most successful algorithm and suggests that gene expression signatures, rather than marker genes, are more valuable in creating models for tissue and cell type prediction in plants. Our data-driven results highlight that the assertion that knowledge from Arabidopsis is translatable to other plants is not always true. Considering the current landscape of abundant sequencing data, we should reevaluate the scientific emphasis on Arabidopsis and prioritize plant diversity.
Methods
We analyzed gene expression data from two sources. The first (Zhang et al., 2020) contains 28,165 Arabidopsis gene expression profiles across 37,334 genes. The second (Palande et al., 2023) contains 2,671 flowering plant gene expression profiles across 6,327 orthogroups.
Originally gene expression profiles were classified into 23 tissue types based on their original designations: “anther,” “carpel,” “cotyledon,” “flower,” “hypocotyl,” “inflorescence,” “internode,” “leaf,” “other,” “petal,” “petiole,” “pistil,” “reproductive-other,” “root,” “root cell,” “seed,” “seedling,” “sepal,” “shoot,” “stamen,” “stigma,” “vasculature,” or “whole plant.”
Due to large differences in sample size between these categories, they were aggregated into four tissue type labels: "aboveground", "below ground", "whole plant", and "other". The categories are purposefully encompassing and were chosen to facilitate accurate assignment across the broad categories of experimental data we analyzed, focusing on aboveground and belowground tissue identity as one of the simplest cases to test tissue predictability.
Samples for which tissue identity could not be determined from their description were discarded, as they were incompatible with our machine learning pipeline. Additionally, we discarded low-quality samples, which we measured by unique mapped rate, or the number of uniquely mapping reads divided by the total number of reads. After removing samples with missing metadata and samples with a low unique mapped rate (<75%), the Arabidopsis database was left with 19,415 samples. A conserved Arabidopsis database was also constructed by keeping only the genes mapped to the orthogroups from the flowering plant database. The conserved Arabidopsis database contained the same number of samples but with much smaller expression profiles across only the 6327 orthogroups shared with the angiosperm dataset.
References:
Zhang, H., F. Zhang, Y. Yu, L. I. Feng, J. Jia, B. O. Liu, B. Li, et al. 2020. A comprehensive online database for exploring ∼20,000 public Arabidopsis RNA-seq libraries. Molecular Plant 13(9): 1231–1233.
Palande, S., J. A. Kaste, M. D. Roberts, K. S. Aba, C. Claucherty, J. Dacon, R. Doko, et al. 2023. The topological shape of gene expression across the evolution of flowering plants. PLoS Biology 21(12): e 3002397.
创建时间:
2024-09-25



