Towards Reliable Data Augmentation in Machine Learning
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/f9k79z29rp
下载链接
链接失效反馈官方服务:
资源简介:
This repository contains the data and code for the paper "Towards Reliable Data Augmentation in Machine Learning: Practices to Prevent Data Leakage." It includes two case studies demonstrating full reproducibility (Tomato Disease Detection and Rice Pest Detection), which show that applying data augmentation before the train-test split leads to severe data leakage and massively inflated performance metrics. These resources enable the complete replication of our findings.
Highlights of the paper:
- Data leakage from pre-split augmentation inflates mAP50 by up to 66 percentage points.
- Two forms of leakage are identified: process-induced and data-inherent data leakage.
- Leakage is a pervasive systemic problem across applied ML domains.
- Actionable best practices are provided for researchers, reviewers, and educators.
创建时间:
2025-09-15



