five

Multiple-Translation Chinese (MTC) Part 4

收藏
DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2006T04
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3><br> <p>Multiple-Translation Chinese (MTC) Part 4 was developed by the Linguistic Data Consortium (LDC) and contains 100 Chinese newswire source files and their translations by four human translator teams and 11 Machine Translation (MT) systems, totalling 1,500 translation files, and also assessments for more than 11,000 segments of the MT output. Of the MT systems, five were commercial-off-the-shelf systems (COTS) and six were participants in the TIDES 2003 MT Evaluation. Of the COTS systems, two were free web-based services and three were commercial software. For this corpus, LDC assessed the output from all the TIDES participants' MT systems and one of the COTS systems.</p><br> <p>To determine if automatic evaluation systems, such as BLEU, track human assessment, LDC also performed human assessments on one COTS output and the six TIDES research systems. The corpus includes the assessment results for one of the five COTS systems, the assessment results for the six TIDES research systems, and the specifications used for conducting the assessments.</p><br> <h3>Data</h3><br> <p>The table below has a breakdown of the text files by source:</p><br> <table style="margin-top: 30px; margin-bottom: 30px;" border="1" width="25%"><br> <tbody><br> <tr><br> <td>Source</td><br> <td>Stories</td><br> <td>Words</td><br> </tr><br> <tr><br> <td>Xinhua News Agency</td><br> <td>50</td><br> <td>19,650</td><br> </tr><br> <tr><br> <td>Agence France Presse</td><br> <td>50</td><br> <td>22,450</td><br> </tr><br> <tr><br> <td>Total</td><br> <td>100</td><br> <td>42,100</td><br> </tr><br> </tbody><br> </table><br> <p>For the Chinese data, there are approximately 21 K-words, while the English translations total 396 K-words and 16K unique words.</p><br> <p>The original source files used GB-2312 encoding for the Chinese characters, and SGML tags for marking sentence and paragraph boundaries and other information about each story. The character encoding is unaltered. To facilitate translation, nearly all SGML tags were removed or replaced by "plain text" markers. The markers were intended to assure that the resulting translations would be easily alignable to the source texts, so extra care was taken to ensure that they would be kept intact and properly oriented. Some normalization was performed on all files to conform to this format, including splitting long segments into smaller chunks and adding segment markers.</p><br> <p>As a last step, all files were converted from UNIX-style line termination (new-line only) to MS-DOS-style (carriage-return plus line-feed) on the assumption that most (possibly all) translators would use MS-Windows-based editors.</p><br> <p><strong>Human Translation:</strong> The human translation teams were required to submit an initial set of five stories for quality evaluation, and after the initial feedback continued with the rest of the assigned stories. For the rest of the stories, their translations were continuously monitored for adherence to guidelines and quality assurance.</p><br> <p><strong>Machine Translation:</strong> Starting from the original SGML text format, special alterations were made to the files on an as-needed basis, so that they would be accepted and handled correctly by the various systems. Also, the systems differed in terms of the input and retrieval methods required to submit the source data for translation and to save the translated text in alignable form.</p><br> <p><strong>Human Assessment:</strong> The goal of this effort was to evaluate the quality of TIDES research, human translation teams, and COTS systems. Translations were evaluated on the basis of adequacy and fluency. Adequacy refers to the degree to which the translation communicates information present in the original source language text. Fluency refers to the degree to which the translation is well-formed according to the grammar of the target language.</p><br> <h3>Samples</h3><br> <p>For an example of the data provided in this corpus, please review the following samples:</p><br> <ul><br> <li><a href="desc/addenda/LDC2006T04_src.txt" rel="nofollow">Chinese source (TXT)</a></li><br> <li><a href="desc/addenda/LDC2006T04_trans.txt" rel="nofollow">English translation (TXT)</a></li><br> </ul><br> <h3>Updates</h3><br> <p>None at this time.</p></br> Portions © 2003 Xinhua News Agency, © 2003 Agence France Press, © 2005-2006 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作