GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web

Name: GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web
Creator: Linguistic Data Consortium
Published: 2021-07-01 16:25:17
License: 暂无描述

DataCite Commons2021-07-01 更新2025-04-16 收录

下载链接：

https://catalog.ldc.upenn.edu/LDC2013T05

下载链接

链接失效反馈

官方服务：

资源简介：

<h3>Introduction</h3> <p>GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web was developed by LDC and contains 158,387 tokens of word aligned Chinese and English parallel text enriched with linguistic tags. This material was used as training data in the <a href="http://projects.ldc.upenn.edu/gale/index.html" rel="nofollow">DARPA GALE</a> (Global Autonomous Language Exploitation) program.</p> <p> Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation. </p> <p>Other releases available in this series are:</p> <ul> <li>GALE Chinese-English Word Alignment and Tagging Training Part 1 -- Newswire and Web (<a href="http://catalog.ldc.upenn.edu/LDC2012T16" rel="nofollow">LDC2012T16</a>)</li> <li>GALE Chinese-English Word Alignment and Tagging Training Part 2 -- Newswire (<a href="http://catalog.ldc.upenn.edu/LDC2012T20" rel="nofollow">LDC2012T20</a>)</li> <li>GALE Chinese-English Word Alignment and Tagging Training Part 3 -- Web (<a href="http://catalog.ldc.upenn.edu/LDC2012T24" rel="nofollow">LDC2012T24</a>)</li> </ul><h3>Data</h3> <p>This release consists of Chinese source web data (newsgroup, weblog) collected by LDC. The distribution by words, character tokens and segments appears below: </p> <table> <tr> <td>Language</td> <td>Files</td> <td>Words</td> <td>CharTokens</td> <td>Segments</td> </tr> <tr> <td>Chinese</td> <td>1,224</td> <td>105,591</td> <td>158,387</td> <td>4,836</td> </tr> </table><p>Note that all token counts are based on the Chinese data only. One token is equivalent to one character and one word is equivalent to 1.5 characters.</p> <p> The Chinese word alignment tasks consisted of the following components: </p> <ul> <li>Identifying, aligning, and tagging 8 different types of links</li> <li>Identifying, attaching, and tagging local-level unmatched words</li> <li>Identifying and tagging sentence/discourse-level unmatched words</li> <li>Identifying and tagging all instances of Chinese 的 (DE) except when they were a part of a semantic link.</li> </ul><h3>Samples</h3> <ul> <li><a href="./desc/addenda/LDC2013T05.cmn.raw.jpg" rel="nofollow">Chinese raw source sample</a></li> <li><a href="./desc/addenda/LDC2013T05.cmn.tkn.jpg" rel="nofollow">Chinese character tokenized sample</a></li> <li><a href="./desc/addenda/LDC2013T05.eng.raw.txt" rel="nofollow">English raw translation sample</a></li> <li><a href="./desc/addenda/LDC2013T05.eng.tkn.txt" rel="nofollow">English tokenized sample</a></li> <li><a href="./desc/addenda/LDC2013T05.wa.txt" rel="nofollow">Character-based word alignment sample</a></li> </ul><h3>Sponsorship</h3> <p>This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.</p> <h3>Updates</h3> <p> None at this time. </p> </br> Portions © 2013 Trustees of the University of Pennsylvania

提供机构：

Linguistic Data Consortium

创建时间：

2020-11-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集