five

MADCAT Phase 1-3 Composite Evaluation Set

收藏
DataCite Commons2026-05-05 更新2026-05-20 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2026T05
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3> <p>MADCAT (Multilingual Automatic Document Classification Analysis and Translation) Phases 1-3 Composite Evaluation Set <a href="../../../LDC2026T05">(LDC2026T05)</a> contains the evaluation data created by the Linguistic Data Consortium (LDC) to support Phases 1-3 of the DARPA MADCAT Program and the <a href="https://www.nist.gov/itl/iad/mig/openhart">NIST OpenHaRT </a>2010 and 2013 evaluations. It consists of handwritten Arabic documents scanned at high resolution and annotated for the physical coordinates of each line and token, digital transcripts, and English translations with content and annotation layers integrated in a single MADCAT XML output.</p> <p>The goal of the MADCAT program was to automatically convert foreign language text images into English transcripts for use by humans and downstream processes, including summarization and information extraction. The core evaluation task in MADCAT was the translation of handwritten Arabic documents.</p> <h3>Data</h3> <p>Arabic source documents were collected by LDC in three genres: newswire, weblog and newsgroup text. Arabic speaking scribes copied documents by hand, following specific instructions as to the writing style (fast, normal, careful), writing implement (pen, pencil) and paper (lined, unlined). Prior to assignment, source documents were processed to optimize their appearance for the handwriting task, which resulted in some source documents separated into multiple pages for handwriting. Each resulting handwritten page was assigned to up to three independent scribes using different writing conditions.</p> <p>The handwritten, transcribed documents were checked for quality and completeness; then each page was scanned at a high resolution (600 dpi, greyscale) to create a digital version of the handwritten document. The scanned images were annotated to indicate the physical coordinates of each line and token. Explicit reading order was also labeled, along with any errors produced by the scribes when copying the text.</p> <p>In the final step, a unified data format was produced consisting of the source text, tokenization and sentence segmentation; an image layer of bounding boxes; a scribe demographic layer containing scribe ID and partition (train/test); and a document metadata layer.</p> <p>This release includes 1,643 images and corresponding annotation files in both GEDI XML and MADCAT XML formats (gedi.xml and .madcat.xml) along with their corresponding scanned image files in TIFF format. GEDI XML files contain ground truth annotations.</p> <table border="1" summary="File Counts By Phase"> <tbody> <tr> <td>Phase</td> <td>File Count</td> </tr> <tr> <td>1</td> <td>470</td> </tr> <tr> <td>2</td> <td>540</td> </tr> <tr> <td>3</td> <td>633</td> </tr> <tr> <td>Total</td> <td>1,643</td> </tr> </tbody> </table> <h3>Sponsorship</h3> <p>This work was supported in part by the Defense Advanced Research Projects Agency, MADCAT Program No. HR0011-08-1-004 and GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.</p> <h3>Updates</h3> <p>No updates at this time.</p>
提供机构:
Linguistic Data Consortium
创建时间:
2026-05-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作