five

anirudh1112/corrected-tobacco-dataset-with-ocr

收藏
Hugging Face2026-04-23 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/anirudh1112/corrected-tobacco-dataset-with-ocr
下载链接
链接失效反馈
官方服务:
资源简介:
该存储库包含Tobacco3482数据集的清理和校正版本,解决了原始数据集中严重影响模型性能的标签错误问题。原始Tobacco3482数据集是文档图像分类的常用数据集,但已知存在噪声标签。此版本整合了研究社区提供的校正,以确保多模态评估的更高标准。数据集创建和清理过程包括对齐、过滤、交叉验证和OCR处理等步骤。数据集组织为标准的train、test和val分割,兼容Hugging Face的datasets库。

This repository contains a cleaned and corrected version of the Tobacco3482 dataset. It addresses significant labeling errors found in the original source that negatively impact model performance. The original Tobacco3482 dataset is a staple for document image classification, but it is known to contain noisy labels. This version integrates corrections provided by the research community to ensure a higher standard for multi-modal evaluation. The dataset creation and cleaning process includes alignment, filtering, intersection verification, and OCR processing. The dataset is organized into standard train, test, and val splits, compatible with the Hugging Face datasets library.
提供机构:
anirudh1112
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作