five

BOLT Chinese-English Word Alignment and Tagging -- Discussion Forum Training

收藏
DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2016T19
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3><br> <p>BOLT Chinese-English Word Alignment and Tagging -- Discussion Forum Training was developed by the Linguistic Data Consortium (LDC) and consists of 448,094 words of Chinese and English parallel text enhanced with linguistic tags to indicate word relations.</p><br> <p>The DARPA <a href="https://www.ldc.upenn.edu/collaborations/current-projects/bolt">BOLT</a> (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.</p><br> <h3>Data</h3><br> <p>This release consists of Chinese source discussion forum threads harvested from the Internet by LDC using a combination of manual and automatic processes. The source data is released as BOLT Chinese Discussion Forums (<a href="../../../LDC2016T05">LDC2016T05</a>).</p><br> <p>The BOLT word alignment task was built on treebank annotation. Specifically, LDC automatically extracted Chinese source tokens, including empty categories/traces, from word-segmented files provided by the BOLT Chinese Treebank annotation team at <a href="http://www.cs.brandeis.edu/~clp/clpg/home.html">Brandeis University</a>. The word-segmented tokens were then used to automatically generate ctb (Chinese Treebank) alignment and were also tokenized for character alignment by inserting white spaces to separate characters.</p><br> <p>The data profile broken down by character tokens, ctb tokens and segments appears below:</p><br> <table border="1," cellpadding="5"><br> <tbody><br> <tr><br> <td>Language</td><br> <td>Genre</td><br> <td>Files</td><br> <td>Words</td><br> <td>CharTokens</td><br> <td>CTBTokens</td><br> <td>Segments</td><br> </tr><br> <tr><br> <td>Chinese</td><br> <td>forum</td><br> <td>570</td><br> <td>448,094</td><br> <td>672,140</td><br> <td>442,520</td><br> <td>20,819</td><br> </tr><br> </tbody><br> </table><br> <h3>Acknowledgement</h3><br> <p>This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.</p><br> <h3>Samples</h3><br> <p>Please view the following samples:</p><br> <ul><br> <li><a href="desc/addenda/LDC2016T19.char.cmn.txt">Chinese Character Tokenized</a></li><br> <li><a href="desc/addenda/LDC2016T19.ctb.cmn.txt">Chinese CTB-Based Tokenized</a></li><br> <li><a href="desc/addenda/LDC2016T19.eng.txt">English Tokenized</a></li><br> <li><a href="desc/addenda/LDC2016T19.char.wa.txt">Character-Based Word Alignment</a></li><br> <li><a href="desc/addenda/LDC2016T19.ctb.wa.txt">CTB-Based Word Alignment</a></li><br> </ul><br> <h3>Updates</h3><br> <p>None at this time.</p></br> Portions © 2012-2016 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作