five

MMTAD

收藏
India Data2025-09-22 更新2026-05-16 收录
下载链接:
https://india-data.org/dataset-details/615fc27d-a264-44e4-9a6b-73642b2c8e9a
下载链接
链接失效反馈
官方服务:
资源简介:
MMTAD (Multilingual Multi-domain Textual Attribute Dataset) comprises 1,623 real-world document images—from legislative records and notices to textbooks and notary documents—captured under diverse lighting, layout, and noise conditions. It delivers 1,117,716 word-level annotations for two attribute groups: T1: Bold,Italic,Bold & Italic T2: Underline,Strikeout,Underline & Strikeout Language & Domain Coverage English, Spanish, and six South Asian languages Distribution: 67.4 % Hindi, 8.2 % Telugu, 8.0 % Marathi, 5.9 % Punjabi, 5.4 % Bengali, 5.2 % Gujarati/Tamil/others 300–500 annotated words per image on average To address class imbalance (e.g., fewer italic or strikeout samples), we apply context-aware augmentations: Shear transforms to generate additional italics Realistic, noisy underline and strikeout overlays These augmentations preserve document context and mimic real-world distortions, ensuring a rich, balanced benchmark for textual attribute recognition.
提供机构:
Natural Language Processing (NLP)
创建时间:
2025-09-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作