ADBench Datasets

Name: ADBench Datasets
Creator: 阿里云天池
Published: 2026-05-15 20:23:55
License: 暂无描述

阿里云天池2026-05-15 更新2024-03-07 收录

下载链接：

https://tianchi.aliyun.com/dataset/159210

下载链接

链接失效反馈

官方服务：

资源简介：

ADBench（NeurlPS22）是最大的AD基准，拥有57个数据集。更具体地说，ADBench使用的数据集涵盖了许多应用领域，包括医疗保健（例如疾病诊断）、音频和语言处理（例如语音识别）、图像处理（例如目标识别）、金融（例如金融欺诈检测）等等，我们在最后一列展示了这些信息。对于那些数据集样本量小于1000的，我们重新采样样本量到1000，对于那些数据集样本量大于10000的，由于计算成本，我们使用其中的10000个子样本。 ADBench新增了一些数据集。由于大多数公共数据集相对较小且简单，我们在ADBench中引入了10个来自计算机视觉（CV）和自然语言处理（NLP）领域的复杂数据集，这些数据集拥有更多的样本和更丰富的特征。选择使用CV/NLP数据集的原因是：直接在大规模CV和NLP数据集上运行选择的浅层方法（例如OCSVM和kNN）往往具有较高的时间复杂度，因此我们遵循DeepSAD、ADIB和DATE的方法，通过神经网络从CV和NLP数据集中提取表示用于下游异常检测任务。特别地，ADIB证明了“从语义任务中转移特征可以为各种AD问题提供强大且通用的表示”，即使预训练任务与下游AD任务仅有松散的联系，这个结论仍然成立。类似地，DeepSAD使用预训练自编码器来提取特征，以训练OCSVM和IForest等传统AD检测器。对于NLP数据集，DATE使用fastText和Glove嵌入来评估传统AD方法（例如OCSVM和IForest）在NLP数据集中的表现与提出的方法相比。我们希望进一步阐述在将CV和NLP数据集适应表格形式的AD时的原因。首先，一些浅层模型（如OCSVM）不能直接运行在（大规模，高维度的）CV数据集上。其次，我们有兴趣看到表格形式的AD方法是否适用于CV/NLP数据表示，在现实世界的应用中，深度模型难以运行。此外，所提取的表示通常会导致更好的下游检测结果。因此，我们使用深度模型从CV和NLP数据集中提取特征，以创建它们的“表格”版本。尽管不完美，但这可能为浅层方法在（原本不可行的）CV和NLP数据集上的性能提供洞见。 CV数据集：对于MNIST-C，我们将原始的MNIST图像设为正常图像，将MNIST-C中的损坏图像设为异常图像，与最近的研究方法一致。对于MVTec-10，我们在15个图像集上测试不同的AD算法，其中异常对应于各种制造缺陷。对于CIFAR10、FashionMNIST和SVHN，我们遵循之前的研究[151, 152]，将其中一个多类别设为正常类别，并将其余类别下采样为总实例数的5%，作为异常。我们报告所有相应类别的平均结果。 NLP数据集：对于Amazon和Imdb，我们将原始的负类别设为异常类别。对于Yelp，我们将0星和1星的评论设为异常类别，将3星和4星的评论设为正常类别。对于20newsgroups数据集，与DATE和CVDD一样，我们仅考虑来自六个顶级类别的文章：计算机、娱乐、科学、其他、ZZ、宗教。同样，对于多类别数据集20newsgroups和Agnews，我们将其中一个类别设为正常类别，并将其余类别下采样为总实例数的5%，作为异常。原论文：https://arxiv.org/abs/2206.09426

ADBench (NeurIPS 2022) is the largest anomaly detection (AD) benchmark, consisting of 57 datasets. More specifically, the datasets used in ADBench cover a wide range of application domains, including healthcare (e.g., disease diagnosis), audio and language processing (e.g., speech recognition), image processing (e.g., object recognition), finance (e.g., financial fraud detection), and more, as shown in the final column. For datasets with fewer than 1,000 samples, we resampled them to 1,000 samples; for datasets with more than 10,000 samples, we used 10,000 subsamples due to computational cost constraints. ADBench has also added several new datasets. Since most public datasets are relatively small and simple, we introduced 10 complex datasets from the computer vision (CV) and natural language processing (NLP) fields into ADBench, which have more samples and richer features. The reasons for selecting CV/NLP datasets are as follows: Shallow baseline methods (e.g., OCSVM and kNN) directly running on large-scale CV and NLP datasets often have high time complexity. Therefore, following the approaches of DeepSAD, ADIB, and DATE, we extract representations from CV and NLP datasets via neural networks for downstream anomaly detection tasks. In particular, ADIB has proven that "transferring features from semantic tasks can provide powerful and general representations for various AD problems", and this conclusion holds even when the pre-training task is only loosely related to the downstream AD task. Similarly, DeepSAD uses pre-trained autoencoders to extract features for training traditional AD detectors such as OCSVM and Isolation Forest (IForest). For NLP datasets, DATE uses fastText and GloVe embeddings to evaluate the performance of traditional AD methods (e.g., OCSVM and IForest) on NLP datasets compared to their proposed methods. We wish to further elaborate on the rationale for adapting CV and NLP datasets to tabular-form AD. First, some shallow models (e.g., OCSVM) cannot directly run on (large-scale, high-dimensional) CV datasets. Second, we aim to investigate whether tabular-form AD methods are applicable to CV/NLP data representations, as deep models are often difficult to deploy in real-world applications. In addition, the extracted representations usually lead to better downstream detection performance. Therefore, we use deep models to extract features from CV and NLP datasets to create their "tabular" versions. Although not perfect, this may provide insights into the performance of shallow methods on CV and NLP datasets that were previously infeasible. CV datasets: For MNIST-C, we treat the original MNIST images as normal samples and the corrupted images in MNIST-C as anomalous samples, consistent with recent research works. For MVTec-10, we test different AD algorithms on 15 image sets, where anomalies correspond to various manufacturing defects. For CIFAR10, FashionMNIST, and SVHN, we follow prior studies [151, 152] by designating one multi-class category as the normal class, and downsampling the remaining categories to 5% of their total instance count as anomalies. We report the average results across all corresponding classes. NLP datasets: For Amazon and Imdb, we treat the original negative category as the anomalous category. For Yelp, we set reviews with 0 or 1 stars as the anomalous category, and reviews with 3 or 4 stars as the normal category. For the 20newsgroups dataset, following DATE and CVDD, we only consider articles from six top-level categories: computer, entertainment, science, others, ZZ, and religion. Likewise, for the multi-class datasets 20newsgroups and Agnews, we designate one category as the normal class, and downsample the remaining categories to 5% of their total instance count as anomalies. Original paper: https://arxiv.org/abs/2206.09426

提供机构：

阿里云天池

创建时间：

2023-07-23

搜集汇总

数据集介绍