five

Towards Reliable Data Augmentation in Machine Learning

收藏
NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/f9k79z29rp
下载链接
链接失效反馈
官方服务:
资源简介:
This repository contains the data and code for the paper "Towards Reliable Data Augmentation in Machine Learning: Practices to Prevent Data Leakage." It includes two case studies demonstrating full reproducibility (Tomato Disease Detection and Rice Pest Detection), which show that applying data augmentation before the train-test split leads to severe data leakage and massively inflated performance metrics. These resources enable the complete replication of our findings. Highlights of the paper: - Data leakage from pre-split augmentation inflates mAP50 by up to 66 percentage points. - Two forms of leakage are identified: process-induced and data-inherent data leakage. - Leakage is a pervasive systemic problem across applied ML domains. - Actionable best practices are provided for researchers, reviewers, and educators.
创建时间:
2025-09-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作