Tex-TAR/MMTAD

Name: Tex-TAR/MMTAD
Creator: Tex-TAR
Published: 2025-07-25 13:57:34
License: 暂无描述

Hugging Face2025-07-25 更新2025-11-01 收录

下载链接：

https://hf-mirror.com/datasets/Tex-TAR/MMTAD

下载链接

链接失效反馈

官方服务：

资源简介：

MMTAD（多语言多领域文本属性数据集）包含1623个真实世界的文档图像，涵盖从立法记录和通知到教科书和公证文件等多种类型。该数据集在多种光照、布局和噪声条件下捕获，提供1117716个单词级别的注释，针对两个属性组：T1（粗体、斜体、粗斜体）和T2（下划线、删除线、下划线和删除线）。数据集覆盖英语、西班牙语和六种南亚语言，平均每张图像标注300至500个单词。为解决类别不平衡问题，数据集应用了上下文感知增强技术，如剪切变换生成额外的斜体，以及添加真实噪声的下划线和删除线覆盖。

MMTAD (Multilingual Multi-domain Textual Attribute Dataset) consists of 1,623 real-world document images ranging from legislative records and notices to textbooks and notary documents, captured under diverse lighting, layout, and noise conditions. It provides 1,117,716 word-level annotations for two attribute groups: T1 (Bold, Italic, Bold & Italic) and T2 (Underline, Strikeout, Underline & Strikeout). The dataset covers English, Spanish, and six South Asian languages, with an average of 300–500 annotated words per image. To address class imbalance, context-aware augmentations such as shear transforms for additional italics and realistic, noisy underline and strikeout overlays are applied.

提供机构：

Tex-TAR

5,000+

优质数据集

54 个

任务类型

进入经典数据集