Intentional Control of Type I Error Over Unconscious Data Distortion: A Neyman–Pearson Approach to Text Classification

Name: Intentional Control of Type I Error Over Unconscious Data Distortion: A Neyman–Pearson Approach to Text Classification
Creator: Taylor & Francis
Published: 2024-02-26 07:56:03
License: 暂无描述

DataCite Commons2024-02-26 更新2024-07-29 收录

下载链接：

https://tandf.figshare.com/articles/dataset/Intentional_Control_of_Type_I_Error_over_Unconscious_Data_Distortion_a_Neyman-Pearson_Approach_to_Text_Classification/11962101/4

下载链接

链接失效反馈

官方服务：

资源简介：

This article addresses the challenges in classifying textual data obtained from open online platforms, which are vulnerable to distortion. Most existing classification methods minimize the overall classification error and may yield an undesirably large Type I error (relevant textual messages are classified as irrelevant), particularly when available data exhibit an asymmetry between relevant and irrelevant information. Data distortion exacerbates this situation and often leads to fallacious prediction. To deal with inestimable data distortion, we propose the use of the Neyman–Pearson (NP) classification paradigm, which minimizes Type II error under a user-specified Type I error constraint. Theoretically, we show that the NP oracle is unaffected by data distortion when the class conditional distributions remain the same. Empirically, we study a case of classifying posts about worker strikes obtained from a leading Chinese microblogging platform, which are frequently prone to extensive, unpredictable and inestimable censorship. We demonstrate that, even though the training and test data are susceptible to different distortion and therefore potentially follow different distributions, our proposed NP methods control the Type I error on test data at the targeted level. The methods and implementation pipeline proposed in our case study are applicable to many other problems involving data distortion. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.

本文针对从易受信息篡改影响的开放在线平台获取的文本数据的分类难题展开研究。多数现有分类方法以最小化总体分类误差为优化目标，却可能产生不合预期的大幅第一类错误（Type I error，即将相关文本消息归类为不相关），尤其当可用数据中相关与不相关信息存在分布不对称的情况时。数据篡改会进一步加剧这一问题，往往导致预测结果出现谬误。为应对难以估量的数据篡改问题，本文提出采用奈曼-皮尔逊（Neyman–Pearson, NP）分类范式，该范式可在用户指定的第一类错误约束下最小化第二类错误（Type II error）。从理论层面证明，当类别条件分布保持不变时，NP分类神谕（NP oracle）不会受到数据篡改的干扰。实证层面，本文以从国内头部微博平台获取的与工人罢工相关的帖子分类任务为研究案例——此类帖子常面临大规模、不可预测且难以估量的审查管控。实验结果表明，即便训练数据与测试数据面临不同程度的篡改，因而可能服从不同的分布，本文提出的NP分类方法仍可将测试集上的第一类错误控制在预设水平。本文案例研究中提出的分类方法与实现流程，可推广应用于诸多其他涉及数据篡改的研究问题。本文的补充材料（含可复现研究的标准化材料说明）可通过在线补充资源获取。

提供机构：

Taylor & Francis

创建时间：

2022-12-08

5,000+

优质数据集

54 个

任务类型

进入经典数据集