PatTR: Patent Translation Resource
收藏DataCite Commons2025-01-28 更新2025-04-17 收录
下载链接:
https://heidata.uni-heidelberg.de/citation?persistentId=doi:10.11588/DATA/10002
下载链接
链接失效反馈官方服务:
资源简介:
PatTR is a sentence-parallel corpus extracted from the <a href='http://www.ir-facility.org/prototypes/marec'>MAREC</a> patent collection. The current version contains more than 22 million German-English and 18 million French-English parallel sentences collected from all patent text sections as well as 5 million German-French sentence pairs from patent titles, abstracts and claims. <br /><br /> The corpus is sorted by language pairs and by text sections of a patent document, namely title, abstrac
t, claims and description. Parallel data from title, abstract and claims sections were extracted from documents belonging to the European Patent Office (
<a href='http://www.epo.org/'>EPO</a>) and the World Intellectual Property Organization (<a href='http://www.wipo.int/portal/en/index.html'>WIPO</a>) corpora in MAREC. Both resources feature multilingual documents that contain for example both an English and a German abstract. <br /><br /> Since there are no multilingual descriptions, data from this section were collected by exploiting patent families to align German and French documents from the EPO corpus to English documents from the United S
tates Patent and Trademark Office (<a href='http://www.uspto.gov/'>USPTO</a>) corpus, following Utiyama, Masao and Isahara, Hitoshi: A Japanese-English patent parallel corpus. MT summit XI (2007), 475--482. <br /><br /> All sections were sentence-aligned using the <a href='http://sourceforge.net/projects/gargantua/'>Gargantua</a> aligner. Preprocessing was done automatically. Sentence boundaries were detected using the <a href='http://www.statmt.org/europarl/'>Europarl</a> processing tools. <br
/><br /> For a detailed description of the corpus construction process, please see the publications above.
提供机构:
heiDATA
创建时间:
2014-06-05



