Detecting Semantic Data Smells with BERT: A Transformer-Based Approach to Data Quality

Name: Detecting Semantic Data Smells with BERT: A Transformer-Based Approach to Data Quality
Creator: figshare
Published: 2025-06-16 13:36:49
License: 暂无描述

DataCite Commons2025-06-16 更新2025-09-08 收录

下载链接：

https://figshare.com/articles/dataset/Detecting_Semantic_Data_Smells_with_BERT_A_Transformer-Based_Approach_to_Data_Quality/29328182/1

下载链接

链接失效反馈

官方服务：

资源简介：

Replication package of the paper "Detecting Semantic Data Smells with BERT: A Transformer-Based Approach to Data Quality". In recent years, the integrity of data used in machine learning pipelines has become increasingly critical, as even state-of-the-art models are constrained by the quality of their input. Among the various threats to data reliability, data smells—subtle and often semantic anomalies—pose unique challenges due to their nuanced and context-sensitive nature. This paper presents an innovative approach to detecting semantic data smells using BERT, a transformer-based language model originally designed for natural language understanding. We focus specifically on two underexplored categories: Split Value Smells and Multiple Value Smells, which respectively reflect improper data fragmentation and conflation. To facilitate model training, we construct a large, heterogeneous corpus of synthetic and real datasets, applying data augmentation techniques via Faker and SDV. Two distinct binary classifiers are trained using fine-tuned BERT models, achieving high performance (F1 scores of 0.88 and 0.98) in detecting these smells. Our findings demonstrate that transformer-based models are not only effective at capturing structural patterns in tabular data, but also capable of generalizing across diverse semantic anomalies. This work establishes a foundation for the broader application of language models in data quality assurance and opens new avenues for semantic-level data cleaning automation.

提供机构：

figshare

创建时间：

2025-06-16

5,000+

优质数据集

54 个

任务类型

进入经典数据集