five

Semantic Textual Similarity (STS) 2013 Machine Translation

收藏
DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2013T18
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3> <p>Semantic Textual Similarity (STS) 2013 Machine Translation was developed as part of the STS 2013 Shared Task which was held in conjunction with <a href="http://clic2.cimec.unitn.it/starsem2013/" rel="nofollow">*SEM 2013</a>, the second joint conference on lexical and computational semantics organized by the ACL (Association of Computational Linguistics) interest groups <a href="http://www.clres.com/siglex.html" rel="nofollow">SIGLEX</a> and <a href="http://www.sigsem.org/wiki/Main_Page" rel="nofollow">SIGSEM</a>. It is comprised of one text file containing 750 English sentence pairs translated from the Arabic and Chinese newswire and web data sources.</p> <p>The goal of the Semantic Textual Similarity (STS) task was to create a unified framework for the evaluation of semantic textual similarity modules and to characterize their impact on natural language processing (NLP) applications. STS measures the degree of semantic equivalence. The STS task was proposed as an attempt at creating a unified framework that allows for an extrinsic evaluation of multiple semantic components that otherwise have historically tended to be evaluated independently and without characterization of impact on NLP applications. More information is available at the <a href="http://ixa2.si.ehu.es/sts" rel="nofollow">STS 2013 Shared Task homepage</a>.</p> <h3>Data</h3> <p>The source data is Arabic and Chinese newswire and web data collected by LDC that was translated and used in the DARPA GALE (Global Autonomous Language Exploitation) program and in several NIST Open Machine Translation evaluations. Of the 750 sentence pairs, 150 pairs are from the GALE Phase 5 collection and 600 pairs are from NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets (<a href="http://catalog.ldc.upenn.edu/LDC2013T07" rel="nofollow">LDC2013T07</a>).</p> <p>The data was built to identify semantic textual similarity between two short text passages. The corpus is comprised of two tab delimited sentences per line. The first sentence is a translation and the second sentence is a post-edited translation. Post-editing is a process to improve machine translation with a minimum of manual labor. The gold standard similarity values and other STS datasets can be obtained from the STS homepage, linked above. </p> <h3>Samples</h3> <p>Please view this <a href="./desc/addenda/LDC2013T18.txt" rel="nofollow">text sample</a>.</p> <h3>Updates</h3> <p> None at this time. </p> </br> Portions © 2007 Agence France Presse, Al-Ahram, Al Hayat, Al-Quds Al-Arabi, Asharq Al-Awsat, An Nahar, Assabah, China Military Online, Chinanews.com, Guangming Daily, Xinhua News Agency, © 2007, 2013 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作