five

NIST Open MT 2008 Evaluation (MT08) Selected References and System Translations

收藏
DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2010T01
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3> <p>This file contains documentation for NIST Open Machine Translation 2008 Evaluation (MT08) Selected Reference and System Translations, Linguistic Data Consortium (LDC) catalog number LDC2010T01 and isbn 1-58563-533-2. </p><p><a href="http://www.itl.nist.gov/iad/mig/tests/mt/" rel="nofollow">NIST Open MT</a> is an evaluation series to support research in, and help advance the state of the art of, technologies that translate text between human languages. Participants submit machine translation output of source language data to NIST (National Institute of Standards and Technology); the output is then evaluated with automatic and manual measures of quality against high quality human translations of the same source data. This program supports the growing interest in system combination approaches that generate improved translations from output of several different machine translation (MT) systems. MT system combination approaches require data sets composed of high-quality human reference translations and a variety of machine translations of the same text. The NIST Open Machine Translation 2008 Evaluation (MT08) Selected Reference and System Translations set addresses this need. </p><p>The data in this release consists of the human reference translations and corresponding machine translations for the <a href="http://www.itl.nist.gov/iad/mig/tests/mt/2008/" rel="nofollow">NIST Open MT08</a> test sets, which consist of newswire and web data in the four MT08 language pairs -- Arabic-to-English, Chinese-to-English, English-to-Chinese (newswire only) and Urdu-to-English. Two documents per language pair and genre were removed at random from the test sets for release. For the machine translations, only output from one submission (in most cases, the participant's primary submission) per training condition (Constrained and Unconstrained training, where available) per participant is included. See section 2 of the MT08 Evaluation Plan for a description of the training conditions. The resulting data set has the following characteristics: </p><ul> <li>Arabic-to-English: 120 documents with 1312 segments, output from 17 machine translation systems.</li> <li>Chinese-to-English: 105 documents with 1312 segments, output from 23 machine translation systems.</li> <li>English-to-Chinese: 127 documents with 1830 segments, output from 11 machine translation systems.</li> <li>Urdu-to-English: 128 documents with 1794 segments, output from 12 machine translation systems.</li> </ul><p>The data is organized and annotated in such a way that subsets for each language pair and/or data genre and/or training condition can be extracted and used separately, depending on the user's needs.</p> <h3>Samples</h3> <ul> <li><a href="./desc/addenda/LDC2010T01_ref_trans.jpg" rel="nofollow">Arabic to English output, reference.</a></li> <li><a href="./desc/addenda/LDC2010T01_sys_trans.jpg" rel="nofollow">Arabic to English output, system</a></li> </ul> </br> Portions © 2007 Agence France Presse, Al-Ahram, Al Hayat, An Nahar, Al Quds - Al Arabi, Asharq Al-Awsat, Assabah, BBC, The Associated Press, China Military Online, Chinanews.com, Daily Jang, Guangming Daily, Los Angeles Times - Washington Post News Service, Inc., New York Times, PakTribune.com, People's Daily Online, Xinhua News Agency, © 2007, 2009, 2010 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作