A Fisher’s Exact Test Justification of the TF–IDF Term-Weighting Scheme
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://figshare.com/articles/dataset/A_Fisher_s_exact_test_justification_of_the_TF_IDF_term-weighting_scheme/29666157
下载链接
链接失效反馈官方服务:
资源简介:
Term frequency–inverse document frequency, or TF–IDF for short, is arguably the most celebrated mathematical expression in the history of information retrieval. Conceived as a simple heuristic quantifying the extent to which a given term’s occurrences are concentrated in any one given document out of many, TF–IDF and its many variants are routinely used as term-weighting schemes in diverse text analysis applications. There is a growing body of scholarship dedicated to placing TF–IDF on a sound theoretical foundation. Building on that tradition, this article justifies the use of TF–IDF to the statistics community by demonstrating how the famed expression can be understood from a significance testing perspective. We show that the common TF–IDF variant TF–ICF is, under mild regularity conditions, closely related to the negative logarithm of the p-value from a one-tailed version of Fisher’s exact test of statistical significance. As a corollary, we establish a connection between TF–IDF and the said negative log-transformed p-value under certain idealized assumptions. We further demonstrate, as a limiting case, that this same quantity converges to TF–IDF in the limit of an infinitely large document collection. The Fisher’s exact test justification of TF–IDF equips the working statistician with a ready explanation of the term-weighting scheme’s long-established effectiveness. Supplementary materials for this article are available online.
词频-逆文档频率(Term frequency–inverse document frequency,简称TF–IDF)堪称信息检索史上最负盛名的数学表达式。其核心设计思路是通过简单的启发式方法,量化某一术语在海量文档集中的单个特定文档内的出现集中程度,TF–IDF及其诸多变体如今已被广泛用作各类文本分析应用中的术语加权方案。当前已有越来越多的学术研究致力于为TF–IDF建立严谨的理论基础。本文正是在此研究传统之上,通过证明这一经典表达式可从显著性检验的视角加以解读,从而向统计学界论证了TF–IDF的合理性。研究表明,常见的TF–IDF变体TF–ICF在温和的正则性条件下,与单尾费希尔精确显著性检验得到的p值的负对数紧密相关。作为推论,本文在特定理想化假设下,建立了TF–IDF与上述负对数转换p值之间的关联。进一步地,本文通过极限情形证明:当文档集合规模趋近于无穷大时,该统计量将收敛至TF–IDF。基于费希尔精确检验的TF–IDF理论依据,为一线统计学家解释该术语加权方案长期以来的有效性提供了直观的理论支撑。本文的补充材料可在线获取。
创建时间:
2025-07-29



