Data from: Predicting classifier performance with limited training data: applications to computer-aided diagnosis in breast and prostate cancer
收藏DataCite Commons2025-06-01 更新2025-04-09 收录
下载链接:
https://datadryad.org/dataset/doi:10.5061/dryad.m5n98
下载链接
链接失效反馈官方服务:
资源简介:
Clinical trials increasingly employ medical imaging data in conjunction
with supervised classifiers, where the latter require large amounts of
training data to accurately model the system. Yet, a classifier selected
at the start of the trial based on smaller and more accessible datasets
may yield inaccurate and unstable classification performance. In this
paper, we aim to address two common concerns in classifier selection for
clinical trials: (1) predicting expected classifier performance for large
datasets based on error rates calculated from smaller datasets and (2) the
selection of appropriate classifiers based on expected performance for
larger datasets. We present a framework for comparative evaluation of
classifiers using only limited amounts of training data by using random
repeated sampling (RRS) in conjunction with a cross-validation sampling
strategy. Extrapolated error rates are subsequently validated via
comparison with leave-one-out cross-validation performed on a larger
dataset. The ability to predict error rates as dataset size increases is
demonstrated on both synthetic data as well as three different
computational imaging tasks: detecting cancerous image regions in prostate
histopathology, differentiating high and low grade cancer in breast
histopathology, and detecting cancerous metavoxels in prostate magnetic
resonance spectroscopy. For each task, the relationships between 3
distinct classifiers (k-nearest neighbor, naive Bayes, Support Vector
Machine) are explored. Further quantitative evaluation in terms of
interquartile range (IQR) suggests that our approach consistently yields
error rates with lower variability (mean IQRs of 0.0070, 0.0127, and
0.0140) than a traditional RRS approach (mean IQRs of 0.0297, 0.0779, and
0.305) that does not employ cross-validation sampling for all three
datasets.
提供机构:
Dryad
创建时间:
2015-01-03



