patrickfleith/controlled-anomalies-time-series-dataset

Name: patrickfleith/controlled-anomalies-time-series-dataset
Creator: patrickfleith
Published: 2023-09-14 18:30:28
License: 暂无描述

Hugging Face2023-09-14 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/patrickfleith/controlled-anomalies-time-series-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - time-series-forecasting - tabular-classification tags: - timeseries - anomaly - detection pretty_name: cats size_categories: - 1M<n<10M --- # Dataset Card for Dataset Name ## Dataset Description Cite the dataset as: Patrick Fleith. (2023). Controlled Anomalies Time Series (CATS) Dataset (Version 2) [Data set]. Solenix Engineering GmbH. https://doi.org/10.5281/zenodo.8338435 ### Dataset Summary The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with **200 injected anomalies.** The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]: ### Supported Tasks and Leaderboards Anomaly Detection in Multivariate Time Series ## Dataset Structure - **Multivariate (17 variables) including sensors reading and control signals.** It simulates the operational behaviour of an arbitrary complex system including: - **4 Deliberate Actuations / Control Commands sent by a simulated operator / controller**, for instance, commands of an operator to turn ON/OFF some equipment. - **3 Environmental Stimuli / External Forces** acting on the system and affecting its behaviour, for instance, the wind affecting the orientation of a large ground antenna. - **10 Telemetry Readings** representing the observable states of the complex system by means of sensors, for instance, a position, a temperature, a pressure, a voltage, current, humidity, velocity, acceleration, etc. - **5 million timestamps**. Sensors readings are at 1Hz sampling frequency. - **1 million nominal observations** (the first 1 million datapoints). This is suitable to start learning the "normal" behaviour. - **4 million observations** that include both nominal and anomalous segments. This is suitable to evaluate both semi-supervised approaches (novelty detection) as well as unsupervised approaches (outlier detection). - **200 anomalous segments**. One anomalous segment may contain several successive anomalous observations / timestamps. Only the last 4 million observations contain anomalous segments. - **Different types of anomalies** to understand what anomaly types can be detected by different approaches. The categories are available in the dataset and in the metadata. - **Fine control over ground truth**. As this is a simulated system with deliberate anomaly injection, the start and end time of the anomalous behaviour is known very precisely. In contrast to real world datasets, there is no risk that the ground truth contains mislabelled segments which is often the case for real data.Suitable for root cause analysis. In addition to the anomaly category, the time series channel in which the anomaly first developed itself is recorded and made available as part of the metadata. This can be useful to evaluate the performance of algorithm to trace back anomalies to the right root cause channel. - **Affected channels**. In addition to the knowledge of the root cause channel in which the anomaly first developed itself, we provide information of channels possibly affected by the anomaly. This can also be useful to evaluate the explainability of anomaly detection systems which may point out to the anomalous channels (root cause and affected). - **Obvious anomalies.** The simulated anomalies have been designed to be "easy" to be detected for human eyes (i.e., there are very large spikes or oscillations), hence also detectable for most algorithms. It makes this synthetic dataset useful for screening tasks (i.e., to eliminate algorithms that are not capable to detect those obvious anomalies). However, during our initial experiments, the dataset turned out to be challenging enough even for state-of-the-art anomaly detection approaches, making it suitable also for regular benchmark studies. - **Context provided**. Some variables can only be considered anomalous in relation to other behaviours. A typical example consists of a light and switch pair. The light being either on or off is nominal, the same goes for the switch, but having the switch on and the light off shall be considered anomalous. In the CATS dataset, users can choose (or not) to use the available context, and external stimuli, to test the usefulness of the context for detecting anomalies in this simulation. - **Pure signal ideal for robustness-to-noise analysis**. The simulated signals are provided without noise: while this may seem unrealistic at first, it is an advantage since users of the dataset can decide to add on top of the provided series any type of noise and choose an amplitude. This makes it well suited to test how sensitive and robust detection algorithms are against various levels of noise. - **No missing data**. You can drop whatever data you want to assess the impact of missing values on your detector with respect to a clean baseline. ### Data Splits - The first 1 million points are nominal (no occurence of anomalies) - The next 4 million points include both nominal and anomalous segments. ### Licensing Information license: cc-by-4.0 ### Citation Information Patrick Fleith. (2023). Controlled Anomalies Time Series (CATS) Dataset (Version 1) [Data set]. Solenix Engineering GmbH. https://doi.org/10.5281/zenodo.7646897

许可证：CC BY 4.0 任务类别： - 时间序列预测 - 表格分类标签： - 时间序列 - 异常 - 检测数据集简称：CATS 规模区间：100万 < 样本量 < 1000万 # 数据集卡片 ## 数据集描述 ### 数据集引用帕特里克·弗莱思（Patrick Fleith）于2023年发布的《受控异常时间序列（Controlled Anomalies Time Series, CATS）数据集（版本2）》[数据集]，索莱尼克斯工程有限公司（Solenix Engineering GmbH），DOI: 10.5281/zenodo.8338435 ### 数据集概述受控异常时间序列（Controlled Anomalies Time Series, CATS）数据集包含某模拟复杂动态系统的指令、外部刺激与遥测读数，共注入200处异常。 CATS数据集具备多项理想特性，非常适合用于多变量时间序列异常检测算法的基准测试[1]： ## 支持任务与基准榜单多变量时间序列异常检测 ## 数据集结构 - **多变量结构（共17个变量），包含传感器读数与控制信号**。该数据集模拟任意复杂系统的运行行为，具体包括： - **4组人工触发的执行动作/控制指令**：由模拟操作员或控制器发出，例如操作员开启/关闭某设备的指令。 - **3种环境刺激/外部作用力**：作用于系统并影响其运行行为，例如影响大型地面天线朝向的风力。 - **10项遥测读数**：通过传感器采集的复杂系统可观测状态数据，例如位置、温度、压力、电压、电流、湿度、速度、加速度等。 - **共500万个时间戳**：传感器读数采样频率为1Hz。 - **100万条正常样本**（前100万条数据）：可用于学习系统的“正常”运行行为。 - **400万条混合样本**：包含正常与异常片段，可用于评估半监督方法（新颖性检测）与无监督方法（离群点检测）的性能。 - **200处异常片段**：单处异常片段可包含多个连续的异常样本/时间戳，仅后400万条样本中包含异常片段。 - **多种异常类型**：可用于研究不同算法可检测的异常类型，异常类别信息可在数据集与元数据中获取。 - **精准可控的真实标签**：本数据集基于模拟系统人工注入异常，因此可精确获知异常行为的起始与结束时间。与真实世界数据集不同，本数据集不存在真实数据中常见的标签误标风险，非常适合用于根因分析。除异常类别外，异常首次出现的时间序列通道信息也会被记录并作为元数据的一部分提供，可用于评估算法将异常追溯至正确根因通道的性能。 - **受影响通道**：除异常首次出现的根因通道外，本数据集还提供了可能受异常影响的通道信息，可用于评估异常检测系统的可解释性——此类系统通常需定位异常通道（包括根因通道与受影响通道）。 - **显著异常样本**：模拟异常被设计为“易于被人类肉眼识别”（即存在大幅尖峰或振荡），因此大多数算法也可检测到此类异常。这使得该合成数据集可用于算法筛选任务（即淘汰无法检测此类显著异常的算法）。但在初始实验中，即便对于当前最先进的异常检测算法，本数据集仍具备足够的挑战性，因此也适合用于常规基准测试研究。 - **提供上下文信息**：部分变量仅在与其他行为关联时才可被判定为异常。典型示例为灯光与开关组合：灯光开启或关闭均属于正常状态，开关同理，但开关开启且灯光关闭的情况则为异常。在CATS数据集中，用户可选择是否使用提供的上下文与外部刺激信息，以测试上下文信息在本模拟场景中对异常检测的有效性。 - **纯净信号，适配抗噪性分析**：本数据集提供的模拟信号不含噪声。尽管初看似乎不符合实际，但这一特性实为优势——数据集使用者可自行决定在原始序列上添加任意类型与幅度的噪声，因此非常适合用于测试检测算法对不同水平噪声的敏感性与鲁棒性。 - **无缺失值数据**：使用者可自行删除部分数据，以基于干净的基准数据评估缺失值对检测算法的影响。 ## 数据划分 - 前100万条样本均为正常样本（无异常发生） - 后续400万条样本同时包含正常与异常片段。 ## 许可证信息许可证：CC BY 4.0 ## 引用信息帕特里克·弗莱思（Patrick Fleith）于2023年发布的《受控异常时间序列（Controlled Anomalies Time Series, CATS）数据集（版本1）》[数据集]，索莱尼克斯工程有限公司（Solenix Engineering GmbH），DOI: 10.5281/zenodo.7646897

提供机构：

patrickfleith

原始信息汇总

数据集引用格式

本数据集的引用格式如下：

Cite the dataset as:

5,000+

优质数据集

54 个

任务类型

进入经典数据集