Automated, Unsupervised, and Auto-parameterized Inference of Data Patterns and Anomaly Detection
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14720493
下载链接
链接失效反馈官方服务:
资源简介:
This repository contains the replication package for the paper "Automated, Unsupervised, and Auto-parameterized Inference of Data Patterns and Anomaly Detection".
1. Introduction
The workflows of our tool include the following five steps:
Column Sampling: Sample a subset of data from the column to generate the patterns.
Coverage Rate (r_{cov}) Estimation: Estimate the percentage of healthy values (r_{cov}) in each column using either the supervised or unsupervised method.
Constrained Template Generation: Constrained Template Generation: Generate raw templates for each record with an exact matching rate r_{EM} (r_{EM}=r_{cov} in our study) as a granularity constraint.
Pattern Generation: Generate pattern constraints for each template according to the coverage rate (r_{cov}).
Pattern Selection: Select patterns based on some heuristics (e.g., their generalizability).
2. Dependencies
Python >= 3.8
Pandas == 1.5.3
Numpy == 1.24.3
Scikit-learn == 1.3.2
Matplotlib == 3.7.5
3. Experiments
The tutorial for data profiling, anomaly detection, and RIOLU's plug-and-play function is provided with a running example in tutorial.ipynb.
Data Profiling
Run code test_profiling.py to get the data profiling result provided by RIOLU's pattern generation module. The patterns for each file, overall precisions, and overall recalls will be printed in the output, and a graphic result will be automatically stored in the "results" folder.
Anomaly Detection
The two anomaly detection tools read the dataset in the "test_anomaly_detection" folder. Change the variable of "dataset" in the code to specify the desired dataset (valid data in our folder: hosp_1k, hosp_10k, hosp_100k, movies), the code will automatically read the ground truth to fetch the columns to be tested (not all the columns contain pattern anomalies).
Unsupervised Version (Auto-RIOLU)
Run code Auto-RIOLU.py to get the result of the unsupervised version of RIOLU; the predicted CSV file will be stored in the "results" folder.
Supervised Version (Guided-RIOLU)
Run code Guided-RIOLU.py to get the result of the supervised version of RIOLU; the predicted CSV file will be stored in the "results" folder. We use the cleaned version of each dataset as a support to estimate the coverage rate (r_{cov}).
Plug and Play: RIOLU for others
We also provide a tutorial of RIOLU's plug-and-play function with a running example in tutorial.ipynb.
Pattern Generator
Our pattern generator (pattern_generator.py) can generate a pattern pool with a water flow structure. Run the following code to create patterns and get the actual coverage of each pattern (i.e., the frequency of matching of each pattern on the whole dataset).
generator = PatternGenerator(sampled_data, coverage_rate)
generator.pattern_coverage_statictics()
pattern_statistics = generator.pattern_coverage
Pattern Selector
Our pattern selector (pattern_selector.py) can select patterns dynamically using the 2-class K-Means clustering technique. To select patterns, you should input a dictionary containing patterns and their frequencies (e.g., {"[0-9]+":0.80980980980981, "[0-9]+[a-z][0-9]+$": 0.135, ...}). The output will be a list of selected patterns.
selector = PatternSelector(pattern_coverage, size)
selector.select_patterns()
pattern_pool = selector.pattern_pool
ChatGPT Comparison
To validate the effectiveness of RIOLU, we also prompted ChatGPT to generate regexes and flag anomalies as a comparison. We used GPT-3.5 Turbo API with all the default settings. The experiments with ChatGPT were done in one trial because extracting regexes from each response requires manual work and careful validation, which, in other words, sadly, requires loads of time.
Prompt template for pattern-based data profiling:
Hi ChatGPT, given the following data (in a column), please provide me with regular expressions for data profiling that represent the column:
*** Your Column Goes Here ***
The responses and regexes for pattern-based data profiling can be found in the "test_data_profiling" folder.
Prompt template for pattern violation detection:
Hi ChatGPT, given the following data (in a column), please accomplish the following two tasks:
1. Provide me with regular expressions that represent the column.
2. Provide me with anomalous cases in the column.
*** Your Column Goes Here ***
The regexes for pattern violation detection can be found in the "test_anomaly_detection" folder.
4. "Discover-Data-Quality-With-RIOLU-A-Replication-Package" Folder structure:
├── ablation_study
├── 20_subsampling.py
├── no_selection.py
├── static_rEM_1.py
├── static_rcov_95.py
├── static_selection_threshold.py
└── readme.md
├── ground_truth_anomaly_detection (Data ground truths)
├── images
├── java_repo_exploration
├── java_names
├── java_naming_anomalies
└── readme.md
├── sensitivity_analysis
├── Auto_RIOLU_alt_inircov.py
├── Auto_RIOLU_alt_nsubset.py
└── readme.md
├── test_anomaly_detection
├── chatgpt_sampled (Data sampled for ChatGPT & the extracted regexes)
├── flights
├── hosp_1k
├── hosp_10k
├── hosp_100k
├── movies
└── readme.md
├── test_data_profiling
├── hetero
├── homo.simple
├── homo
├── GPT_responses.csv (ChatGPT profiling responses & the extracted regexes)
└── readme.md
├── Auto-RIOLU.py (Auto-RIOLU for anomaly detection)
├── Guided-RIOLU.py (Guided-RIOLU for anomaly detection)
├── pattern_generator.py
├── pattern_selector.py
├── pattern_summarizer.py
├── test_profiling.py (RIOLU for data profiling)
├── tutorial.ipynb
├── utils.py
├── LICENSE
└── readme.md
创建时间:
2025-02-14



