MADCAT Chinese Pilot Training Set
收藏Mendeley Data2024-01-31 更新2024-06-27 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2014T13
下载链接
链接失效反馈官方服务:
资源简介:
Introduction MADCAT (Multilingual Automatic Document Classification Analysis and Translation) Chinese Pilot Training Set contains all training data created by the Linguistic Data Consortium (LDC) to support a Chinese pilot collection in the DARPA MADCAT Program. The data in this release consists of handwritten Chinese documents, scanned at high resolution and annotated for the physical coordinates of each line and token. Digital transcripts and English translations of each document are also provided, with the various content and annotation layers integrated in a single MADCAT XML output. The goal of the MADCAT program was to automatically convert foreign text images into English transcripts. MADCAT Chinese pilot data was collected from Chinese source documents in three genres: newswire, weblog and newsgroup text. Chinese speaking "scribes" copied documents by hand, following specific instructions on writing style (fast, normal, careful), writing implement (pen, pencil) and paper (lined, unlined). Prior to assignment, source documents were processed to optimize their appearance for the handwriting task, which resulted in some original source documents being broken into multiple "pages" for handwriting. Each resulting handwritten page was assigned to up to five independent scribes, using different writing conditions. The handwritten, transcribed documents were next checked for quality and completeness, then each page was scanned at a high resolution (600 dpi, greyscale) to create a digital version of the handwritten document. The scanned images were then annotated to indicate the physical coordinates of each line and token. Explicit reading order was also labeled, along with any errors produced by the scribes when copying the text. The final step was to produce a unified data format that takes multiple data streams and generates a single MADCAT XML output file which contains all required information. The resulting madcat.xml file contains distinct components: a text layer that consists of the source text, tokenization and sentence segmentation; an image layer that consist of bounding boxes; a scribe demographic layer that consists of scribe ID and partition (train/test); and a document metadata layer. LDC has also released: MADCAT Phase 1 Training Set (LDC2012T15) MADCAT Phase 2 Training Set (LDC2013T09) MADCAT Phase 3 Training Set (LDC2013T15) Data This release includes 22,284 annotation files in both GEDI XML and MADCAT XML formats (gedi.xml and .madcat.xml) along with their corresponding scanned image files in TIFF format. The annotation results in GEDI XML files include ground truth annotations and source transcripts. Files are named as follows: galeID_page#_scribeID.{tif|gedi.xml|madcat.xml} Samples Please view the following samples: Raw Image GEDI XML MADCAT XML Sponsorship This work was supported in part by the Defense Advanced Research Projects Agency, MADCAT Program Grant No. HR0011-08-1-0004 and GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. Updates None at this time. Portions © 2007 China Military Online, Chinanews.com, Guangming Daily, Peoples Daily, © 2007, 2014 Trustees of the University of Pennsylvania
## 引言
MADCAT(多语言自动文档分类分析与翻译,Multilingual Automatic Document Classification Analysis and Translation)中文试点训练集包含语言数据联盟(Linguistic Data Consortium,LDC)为支持美国国防高级研究计划局(Defense Advanced Research Projects Agency,DARPA)MADCAT项目中的中文试点数据集所创建的全部训练数据。
本发布包包含高分辨率扫描的手写中文文档,且已针对每一行与每个词元(Token)的物理坐标完成标注。此外还提供了每份文档的数字转录文本与英文译文,各类内容与标注层均整合至单一MADCAT XML输出文件中。
MADCAT项目的核心目标是自动将外语文本图像转换为英文转录文本。
MADCAT中文试点数据源自三种体裁的中文源文档:新闻专线、博客与新闻组文本。
具备中文语言能力的“抄写员”将按照指定要求手动抄写文档,包括书写风格(快速、常规、细致)、书写工具(钢笔、铅笔)以及纸张类型(带横线、无横线)。
在分配抄写任务前,工作人员会对源文档进行预处理以优化手写抄写的呈现效果,这使得部分原始源文档被拆分为多个“页面”用于手写抄写作业。
每份最终生成的手写页面会被分配给最多5名独立抄写员,采用不同的书写条件完成抄写任务。
完成手写抄写的文档将先经过质量与完整性校验,随后以600dpi灰度模式扫描每份页面,生成手写文档的数字版本。
随后会对扫描得到的图像进行标注,以指明每一行与每个词元的物理坐标。同时还会标注明确的阅读顺序,以及抄写员在抄写过程中产生的各类错误。
最后一步是构建统一的数据格式,整合多数据流并生成包含所有必要信息的单一MADCAT XML输出文件。
生成的madcat.xml文件包含四个独立组件:包含源文本、词元划分与分句的文本层;包含边界框的图像层;包含抄写员ID与数据集划分(训练/测试集)的抄写员人口统计层;以及文档元数据层。
LDC同期还发布了以下系列数据集:
- MADCAT第一阶段训练集(LDC2012T15)
- MADCAT第二阶段训练集(LDC2013T09)
- MADCAT第三阶段训练集(LDC2013T15)
## 数据说明
本次发布包含22284个标注文件,格式涵盖GEDI XML与MADCAT XML(对应文件分别为gedi.xml与.madcat.xml),同时附带对应的TIFF格式扫描图像文件。
GEDI XML文件中的标注结果包含基准真值标注与源转录文本。
文件命名规则如下:galeID_page#_scribeID.{tif|gedi.xml|madcat.xml}
## 示例
请查看以下示例文件:原始图像、GEDI XML、MADCAT XML
## 资助说明
本工作部分受美国国防高级研究计划局MADCAT项目(资助编号HR0011-08-1-0004)与GALE项目(资助编号HR0011-06-1-0003)资助。
本文内容不一定代表美国政府的立场或政策,不应被视为获得官方背书。
## 更新说明
暂无更新计划。
## 版权声明
部分内容©2007 China Military Online、Chinanews.com、Guangming Daily、People's Daily,©2007、2014 宾夕法尼亚大学托管委员会。
创建时间:
2024-01-31
搜集汇总
数据集介绍

背景与挑战
背景概述
MADCAT Chinese Pilot Training Set是一个包含手写中文文档扫描图像及注释的训练数据集,支持手写识别和机器翻译应用。数据集提供高分辨率扫描图像、数字转录、英文翻译及多层次的注释信息,所有内容集成在MADCAT XML格式中。
以上内容由遇见数据集搜集并总结生成



