PaGeS
收藏SSH Open MarketPlace2023-10-13 更新2024-08-03 收录
下载链接:
https://marketplace.sshopencloud.eu/dataset/jCTHNn
下载链接
链接失效反馈官方服务:
资源简介:
This corpus is comprised of two major parts: the core corpus and the supplements.
The core corpus is comprised of original texts in German and Spanish and their respective translations, as well as a small percentage (approx. 6%) of German and Spanish texts translated from a third language. The core corpus includes samples from 178 works of fiction (novels and short stories) as well as samples from non-fiction (essays and popular texts).
The text have been manually verified at different levels and the automatic alignment of the bisegments, performed by [LF-Aligner](https://sourceforge.net/projects/aligner/), has been manually reviewed. The German texts have been lemmatized and PoS-tagged with [Treetagger](http://hdl.handle.net/11022/1007-0000-0000-8E4D-B) (part of the [PoS taggers and lemmatizers Resource Family](https://www.clarin.eu/resource-families/tools-part-speech-tagging-and-lemmatization)) and the Spanish texts with [Freeling](https://nlp.lsi.upc.edu/freeling/node/1) . The tags of both have been mapped to the Universal PoS tags.
The supplements include so far: [Europarl v7](https://www.statmt.org/europarl/), a corpus that collects the proceedings (Verbatim reports) of the European Parliament from 1996 to 2011 (also part of the [Parliamentary Corpora Resource Family](https://www.clarin.eu/resource-families/parliamentary-corpora)); and Ted-Talks (part of this family), a corpus that collects the German and Spanish translations of the transcriptions of Ted-Talks from 2006 to 2020.
The corpus is available for online browsing via a dedicated interface.
创建时间:
2023-10-13



