five

Multiple-Translation Chinese (MTC) Part 3

收藏
DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2004T07
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3> <p>Multiple-Translation Chinese (MTC) Part 3 was produced by Linguistic Data Consortium (LDC) catalog number LDC2004T07 and ISBN 1-58563-289-9. </p><p>To support the development of automatic means for evaluating translation quality, the LDC was sponsored to solicit four sets of human translations for a single set of Mandarin Chinese source materials. </p><p>Two similar corpora, <a href="http://catalog.ldc.upenn.edu/LDC2002T01" rel="nofollow">Multiple-Translation Chinese Corpus</a>, and <a href="http://catalog.ldc.upenn.edu/LDC2003T17" rel="nofollow">Multiple-Translation Chinese Corpus Part 2</a> were published in 2002 and 2003. The 2002 corpus (Part 1), 2003 corpus (Part 2), and the present corpus used Chinese news articles from multiple sources and provide human translations for them. However, Part 1 also offers translations produced from various commercial-off-the-shelf-systems (COTS). In addition to human and COTS translations, Part 2 also offers translations from a TIDES research system, and provides human assessment for some of the automatic translations. </p><h3>Data</h3> <p>Two sources of journalistic Mandarin Chinese text were selected to provide the Chinese material: </p> - AFP News Service: 50 news stories - Xinhua News Service: 50 news stories (total: 100 stories) <p>The data was drawn from the May and June 2002 collection of AFP and Xinhua news. </p><p>The story selection from the two newswire collections was controlled by story length: all selected stories contain between about 230 and 564 Chinese characters. The overall count of Chinese characters by source is shown in the following table: </p> AFP 22,135 Xinhua 20,321 --------------- total 42,456 <p>For the Chinese data, there are approximately 21K-words, while for the English translation, there are approximately 100K-words in total, and 12K unique words. </p><p>Four best translation teams were chosen from the 11 teams which had participated in the translation of Multiple Translation Chinese Corpus Part 1 (<a href="http://catalog.ldc.upenn.edu/LDC2002T01" rel="nofollow">LDC2002T01</a>) and Part 2 (<a href="http://catalog.ldc.upenn.edu/LDC2003T17" rel="nofollow">LDC2003T17</a>) to take part in the project. </p><p>In accordance with the guidelines, each translation team was asked to return the first ten Xinhua stories for quality checking. This was to ensure that each translation team had indeed understood and was following the guidelines, and the translation quality was acceptable. The LDC sent the translations back to the translation team for any deviations from the guidelines or any quality issues detected. </p><p>Subsequent translation submissions were continuously monitored for conformance and quality. Once the full set of translations was complete, a final pass of reformatting and validation was carried out, to assure alignability of segments, and to convert the translated texts into SGML format. </p><p>Each translation team was also asked to fill out and return a questionnaire to describe their procedures and professional background. </p><p>Please click here for a <a href="./desc/addenda/LDC2004T07.chn" rel="nofollow">Chinese</a> and an <a href="./desc/addenda/LDC2004T07.eng" rel="nofollow">English</a> example. (Characters in Chinese can be displayed by selecting Chinese encoding in your brower.) </p><h3> Updates</h3> <p>There are no updates available at this time. </p> </br> Portions © 2002 Xinhua News Agency, © 2002 Agence France-Presse, © 2004 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作