Expression-based machine learning models for predicting plant tissue identity
收藏DataONE2024-09-25 更新2025-08-23 收录
下载链接:
https://search.dataone.org/view/sha256:e1a2dd3fbdd779f496283d1365e7ff0efa63ed09d055449d26ec29de0b5bb51e
下载链接
链接失效反馈官方服务:
资源简介:
The selection of Arabidopsis as a model organism played a pivotal role in advancing genomic science. Competing frameworks to select an agricultural- or ecological-based model species were selected against in favor of building knowledge in a species that would facilitate genome-enabled research. Here, we examine the ability of models based on Arabidopsis gene expression data to predict tissue identity in other flowering plants. Comparing different machine learning algorithms, models trained and tested on Arabidopsis data achieved near-perfect precision and recall values, whereas when tissue identity is predicted across the flowering plants using models trained on Arabidopsis data, precision values range from 0.69 to 0.74 and recall from 0.54 to 0.64. Below-ground tissue is more predictable than other tissue types, and the ability to predict tissue identity is not correlated with phylogenetic distance from Arabidopsis. K-Nearest Neighbors is the most successful algorithm and suggests that..., We analyzed gene expression data from two sources. The first (Zhang et al., 2020) contains 28,165 Arabidopsis gene expression profiles across 37,334 genes. The second (Palande et al., 2023) contains 2,671 flowering plant gene expression profiles across 6,327 orthogroups.
Originally gene expression profiles were classified into 23 tissue types based on their original designations: âanther,â âcarpel,â âcotyledon,â âflower,â âhypocotyl,â âinflorescence,â âinternode,â âleaf,â âother,â âpetal,â âpetiole,â âpistil,â âreproductive-other,â âroot,â âroot cell,â âseed,â âseedling,â âsepal,â âshoot,â âstamen,â âstigma,â âvasculature,â or âwhole plant.â
Due to large differences in sample size between these categories, they were aggregated into four tissue type labels: \"aboveground\", \"below ground\", \"whole plant\", and \"other\". The categories are purposefully encompassing and were chosen to facilitate accurate assignment across the broad categories of experimental data we analyzed, focusing on aboveg..., , # Expression-based machine learning models for predicting plant tissue identity
*Arabidopsis* Gene Expression Dataset
[https://doi.org/10.5061/dryad.4b8gthtn7](https://doi.org/10.5061/dryad.4b8gthtn7)
The dataset contains three `.parquet` files:
1\) `gene_FPKM_200501.parquet`: The original gene expression database was downloaded from the [Arabidopsis RNA-Seq Database](https://plantrnadb.com/athrdb/) ([Zhang et al, 2020)](https://doi.org/10.1016/j.molp.2020.08.001). The original dataset contains 28,165 Arabidopsis gene expression profiles across 37,334 genes.
2\) `gene_FPKM_transposed.parquet`: Simply the transposed version of `gene_FPKM_200501.parquet` which is better aligned with typical machine learning datasets where samples are represented in rows.
3\) `gene_FPKM_transposed_UMR75.parquet`: The gene expression profiles (`gene_FPKM_transposed.parquet`) were filtered to remove samples with a unique mapped rate below 75%. This dataset is used to train and test machine learning model...
创建时间:
2025-08-05



