five

Data used to produce Fig 2B.

收藏
Figshare2023-03-02 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/Data_used_to_produce_Fig_2B_/22206652
下载链接
链接失效反馈
官方服务:
资源简介:
We assess inferential quality in the field of differential expression profiling by high-throughput sequencing (HT-seq) based on analysis of datasets submitted from 2008 to 2020 to the NCBI GEO data repository. We take advantage of the parallel differential expression testing over thousands of genes, whereby each experiment leads to a large set of p-values, the distribution of which can indicate the validity of assumptions behind the test. From a well-behaved p-value set π0, the fraction of genes that are not differentially expressed can be estimated. We found that only 25% of experiments resulted in theoretically expected p-value histogram shapes, although there is a marked improvement over time. Uniform p-value histogram shapes, indicative of π0-s of less than 0.5, as if most genes changed their expression level. Most HT-seq experiments have very small sample sizes and are expected to be underpowered. Nevertheless, the estimated π0-s do not have the expected association with N, suggesting widespread problems of experiments with controlling false discovery rate (FDR). Both the fractions of different p-value histogram types and the π0 values are strongly associated with the differential expression analysis program used by the original authors. While we could double the proportion of theoretically expected p-value distributions by removing low-count features from the analysis, this treatment did not remove the association with the analysis program. Taken together, our results indicate widespread bias in the differential expression profiling field and the unreliability of statistical methods used to analyze HT-seq data.

本研究针对基于高通量测序(high-throughput sequencing, HT-seq)的差异表达谱分析领域的推断质量展开评估,分析对象为2008年至2020年间提交至美国国家生物技术信息中心(National Center for Biotechnology Information, NCBI)基因表达汇编(Gene Expression Omnibus, GEO)数据仓库的数据集。研究利用针对数千个基因的平行差异表达检验特性:每项实验均可生成大量p值,其分布特征可反映检验背后所依托统计假设的合理性。通过符合统计特性的p值集合,可估算得到未发生差异表达的基因占比(记为π₀)。 研究发现,仅25%的实验呈现理论预期的p值直方图分布形态,但该比例随时间推移已有显著提升。呈现均匀分布的p值直方图则对应π₀小于0.5,意味着多数基因的表达水平发生了改变。多数HT-seq实验的样本量极小,普遍存在统计效力不足的问题。然而,估算得到的π₀与样本量N并未呈现理论预期的关联关系,这表明当前实验在控制错误发现率(false discovery rate, FDR)方面普遍存在问题。不同p值直方图类型的占比与π₀数值,均与原始研究作者所采用的差异表达分析软件存在显著关联。尽管通过移除分析中的低计数特征,可将符合理论预期的p值分布占比提升一倍,但该处理并未消除分析软件与分布特征之间的关联。 综上,本研究结果表明,差异表达谱分析领域普遍存在统计偏倚,且用于分析HT-seq数据的统计方法可靠性不足。
创建时间:
2023-03-02
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作