Positive and Unlabeled Data: Model, Estimation, Inference, and Classification

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://figshare.com/articles/dataset/Positive_and_Unlabeled_Data_Model_Estimation_Inference_and_Classification/28590277

下载链接

链接失效反馈

官方服务：

资源简介：

This study introduces a new approach to addressing the positive and unlabeled (PU) data through the double exponential tilting model (DETM) under a transfer learning framework. Traditional methods often fall short because they only apply to the common distributions (CD) PU data (also known as the selected completely at random PU data), where the labeled positive and unlabeled positive data are assumed to be from the same distribution. In contrast, our DETM’s dual structure effectively accommodates the more complex and underexplored different distribution (DD) PU data (also known as the selected at random PU data), where the labeled and unlabeled positive data can be from different distributions. We rigorously establish the theoretical foundations of DETM, including identifiability, parameter estimation, and asymptotic properties. Additionally, we move forward to statistical inference by developing a goodness-of-fit test for the CD assumption and constructing confidence intervals for the proportion of positive instances in the target domain. We leverage an approximated Bayes classifier for classification tasks, demonstrating DETM’s robust performance in prediction. Through theoretical insights and practical applications, this study highlights DETM as a comprehensive framework for addressing the challenges of PU data.Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.

本研究提出了一种全新方法，用于在迁移学习框架（transfer learning framework）下，借助双指数倾斜模型（double exponential tilting model，DETM）处理正样本未标注（positive and unlabeled，PU）数据。传统方法往往存在局限，仅适用于公共分布（common distributions，CD）型PU数据——此类数据亦称为完全随机选择型PU数据（selected completely at random PU data），其假设标注正样本与未标注正样本服从同一分布。与之相对，本研究提出的DETM凭借其双重结构，可有效适配更为复杂且尚未得到充分探索的异分布（different distribution，DD）型PU数据——此类数据亦称为随机选择型PU数据（selected at random PU data），其中标注正样本与未标注正样本可服从不同分布。本研究严格建立了DETM的理论基础，涵盖可识别性（identifiability）、参数估计（parameter estimation）与渐近性质（asymptotic properties）。此外，本研究进一步推进了统计推断工作：针对公共分布假设构建了拟合优度检验（goodness-of-fit test），并为目标域中正样本比例构建了置信区间（confidence intervals）。我们借助近似贝叶斯分类器（approximated Bayes classifier）开展分类任务，验证了DETM在预测任务中的稳健性能。通过理论分析与实际应用，本研究证明DETM是一套可全面应对PU数据挑战的完整框架。本文的补充材料可在线获取，其中包含可复现研究成果的标准化材料说明。

创建时间：

2025-03-13

5,000+

优质数据集

54 个

任务类型

进入经典数据集