five

PaGeS

收藏
SSH Open MarketPlace2023-10-13 更新2024-08-03 收录
下载链接:
https://marketplace.sshopencloud.eu/dataset/jCTHNn
下载链接
链接失效反馈
官方服务:
资源简介:
This corpus is comprised of two major parts: the core corpus and the supplements. The core corpus is comprised of original texts in German and Spanish and their respective translations, as well as a small percentage (approx. 6%) of German and Spanish texts translated from a third language. The core corpus includes samples from 178 works of fiction (novels and short stories) as well as samples from non-fiction (essays and popular texts). The text have been manually verified at different levels and the automatic alignment of the bisegments, performed by [LF-Aligner](https://sourceforge.net/projects/aligner/), has been manually reviewed. The German texts have been lemmatized and PoS-tagged with [Treetagger](http://hdl.handle.net/11022/1007-0000-0000-8E4D-B) (part of the [PoS taggers and lemmatizers Resource Family](https://www.clarin.eu/resource-families/tools-part-speech-tagging-and-lemmatization)) and the Spanish texts with [Freeling](https://nlp.lsi.upc.edu/freeling/node/1) . The tags of both have been mapped to the Universal PoS tags. The supplements include so far: [Europarl v7](https://www.statmt.org/europarl/), a corpus that collects the proceedings (Verbatim reports) of the European Parliament from 1996 to 2011 (also part of the [Parliamentary Corpora Resource Family](https://www.clarin.eu/resource-families/parliamentary-corpora)); and Ted-Talks (part of this family), a corpus that collects the German and Spanish translations of the transcriptions of Ted-Talks from 2006 to 2020. The corpus is available for online browsing via a dedicated interface.
创建时间:
2023-10-13
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作