five

A Fisher’s Exact Test Justification of the TF–IDF Term-Weighting Scheme

收藏
Figshare2025-07-29 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/A_Fisher_s_exact_test_justification_of_the_TF_IDF_term-weighting_scheme/29666157
下载链接
链接失效反馈
官方服务:
资源简介:
Term frequency–inverse document frequency, or TF–IDF for short, is arguably the most celebrated mathematical expression in the history of information retrieval. Conceived as a simple heuristic quantifying the extent to which a given term’s occurrences are concentrated in any one given document out of many, TF–IDF and its many variants are routinely used as term-weighting schemes in diverse text analysis applications. There is a growing body of scholarship dedicated to placing TF–IDF on a sound theoretical foundation. Building on that tradition, this article justifies the use of TF–IDF to the statistics community by demonstrating how the famed expression can be understood from a significance testing perspective. We show that the common TF–IDF variant TF–ICF is, under mild regularity conditions, closely related to the negative logarithm of the p-value from a one-tailed version of Fisher’s exact test of statistical significance. As a corollary, we establish a connection between TF–IDF and the said negative log-transformed p-value under certain idealized assumptions. We further demonstrate, as a limiting case, that this same quantity converges to TF–IDF in the limit of an infinitely large document collection. The Fisher’s exact test justification of TF–IDF equips the working statistician with a ready explanation of the term-weighting scheme’s long-established effectiveness. Supplementary materials for this article are available online.
创建时间:
2025-07-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作