Tex-TAR/MMTAD
收藏Hugging Face2025-07-25 更新2025-11-01 收录
下载链接:
https://hf-mirror.com/datasets/Tex-TAR/MMTAD
下载链接
链接失效反馈官方服务:
资源简介:
MMTAD(多语言多领域文本属性数据集)包含1623个真实世界的文档图像,涵盖从立法记录和通知到教科书和公证文件等多种类型。该数据集在多种光照、布局和噪声条件下捕获,提供1117716个单词级别的注释,针对两个属性组:T1(粗体、斜体、粗斜体)和T2(下划线、删除线、下划线和删除线)。数据集覆盖英语、西班牙语和六种南亚语言,平均每张图像标注300至500个单词。为解决类别不平衡问题,数据集应用了上下文感知增强技术,如剪切变换生成额外的斜体,以及添加真实噪声的下划线和删除线覆盖。
MMTAD (Multilingual Multi-domain Textual Attribute Dataset) consists of 1,623 real-world document images ranging from legislative records and notices to textbooks and notary documents, captured under diverse lighting, layout, and noise conditions. It provides 1,117,716 word-level annotations for two attribute groups: T1 (Bold, Italic, Bold & Italic) and T2 (Underline, Strikeout, Underline & Strikeout). The dataset covers English, Spanish, and six South Asian languages, with an average of 300–500 annotated words per image. To address class imbalance, context-aware augmentations such as shear transforms for additional italics and realistic, noisy underline and strikeout overlays are applied.
提供机构:
Tex-TAR



