PaGeS

SSH Open MarketPlace2023-10-13 更新2024-08-03 收录

下载链接：

https://marketplace.sshopencloud.eu/dataset/jCTHNn

下载链接

链接失效反馈

官方服务：

资源简介：

This corpus is comprised of two major parts: the core corpus and the supplements. The core corpus is comprised of original texts in German and Spanish and their respective translations, as well as a small percentage (approx. 6%) of German and Spanish texts translated from a third language. The core corpus includes samples from 178 works of fiction (novels and short stories) as well as samples from non-fiction (essays and popular texts). The text have been manually verified at different levels and the automatic alignment of the bisegments, performed by [LF-Aligner](https://sourceforge.net/projects/aligner/), has been manually reviewed. The German texts have been lemmatized and PoS-tagged with [Treetagger](http://hdl.handle.net/11022/1007-0000-0000-8E4D-B) (part of the [PoS taggers and lemmatizers Resource Family](https://www.clarin.eu/resource-families/tools-part-speech-tagging-and-lemmatization)) and the Spanish texts with [Freeling](https://nlp.lsi.upc.edu/freeling/node/1) . The tags of both have been mapped to the Universal PoS tags. The supplements include so far: [Europarl v7](https://www.statmt.org/europarl/), a corpus that collects the proceedings (Verbatim reports) of the European Parliament from 1996 to 2011 (also part of the [Parliamentary Corpora Resource Family](https://www.clarin.eu/resource-families/parliamentary-corpora)); and Ted-Talks (part of this family), a corpus that collects the German and Spanish translations of the transcriptions of Ted-Talks from 2006 to 2020. The corpus is available for online browsing via a dedicated interface.

创建时间：

2023-10-13

5,000+

优质数据集

54 个

任务类型

进入经典数据集