BOLT Chinese-English Word Alignment and Tagging -- Discussion Forum Training
收藏DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2016T19
下载链接
链接失效反馈官方服务:
资源简介:
<h3>Introduction</h3><br>
<p>BOLT Chinese-English Word Alignment and Tagging -- Discussion Forum Training was developed by the Linguistic Data Consortium (LDC) and consists of 448,094 words of Chinese and English parallel text enhanced with linguistic tags to indicate word relations.</p><br>
<p>The DARPA <a href="https://www.ldc.upenn.edu/collaborations/current-projects/bolt">BOLT</a> (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.</p><br>
<h3>Data</h3><br>
<p>This release consists of Chinese source discussion forum threads harvested from the Internet by LDC using a combination of manual and automatic processes. The source data is released as BOLT Chinese Discussion Forums (<a href="../../../LDC2016T05">LDC2016T05</a>).</p><br>
<p>The BOLT word alignment task was built on treebank annotation. Specifically, LDC automatically extracted Chinese source tokens, including empty categories/traces, from word-segmented files provided by the BOLT Chinese Treebank annotation team at <a href="http://www.cs.brandeis.edu/~clp/clpg/home.html">Brandeis University</a>. The word-segmented tokens were then used to automatically generate ctb (Chinese Treebank) alignment and were also tokenized for character alignment by inserting white spaces to separate characters.</p><br>
<p>The data profile broken down by character tokens, ctb tokens and segments appears below:</p><br>
<table border="1," cellpadding="5"><br>
<tbody><br>
<tr><br>
<td>Language</td><br>
<td>Genre</td><br>
<td>Files</td><br>
<td>Words</td><br>
<td>CharTokens</td><br>
<td>CTBTokens</td><br>
<td>Segments</td><br>
</tr><br>
<tr><br>
<td>Chinese</td><br>
<td>forum</td><br>
<td>570</td><br>
<td>448,094</td><br>
<td>672,140</td><br>
<td>442,520</td><br>
<td>20,819</td><br>
</tr><br>
</tbody><br>
</table><br>
<h3>Acknowledgement</h3><br>
<p>This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.</p><br>
<h3>Samples</h3><br>
<p>Please view the following samples:</p><br>
<ul><br>
<li><a href="desc/addenda/LDC2016T19.char.cmn.txt">Chinese Character Tokenized</a></li><br>
<li><a href="desc/addenda/LDC2016T19.ctb.cmn.txt">Chinese CTB-Based Tokenized</a></li><br>
<li><a href="desc/addenda/LDC2016T19.eng.txt">English Tokenized</a></li><br>
<li><a href="desc/addenda/LDC2016T19.char.wa.txt">Character-Based Word Alignment</a></li><br>
<li><a href="desc/addenda/LDC2016T19.ctb.wa.txt">CTB-Based Word Alignment</a></li><br>
</ul><br>
<h3>Updates</h3><br>
<p>None at this time.</p></br>
Portions © 2012-2016 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30



