five

EUROPARL Corpus Parallel Corpora: Portuguese-English

收藏
DataCite Commons2022-06-01 更新2025-04-15 收录
下载链接:
https://live.european-language-grid.eu/catalogue/corpus/1400
下载链接
链接失效反馈
官方服务:
资源简介:
The EUROPARL Corpus (Portuguese-English subpart of the parallel corpora), was extracted from the proceedings of the European Parliament. It contains transcriptions of sessions dating back from 1996 to 2011, with a total of approximately 58,324,562 tokens of European Portuguese (L1) and 49,216,896 tokens of English (translation). <p><p>The EUROPARL Corpus is composed of one text file for the English corpus and two files for the Portuguese version: a text file and an annotated file. The text version contains plain text and no further annotation. The Portuguese annotated file is a four-column file with one token per line, followed by a PoS tag and a lemma. The corpus was automatically PoS-tagged with MBT tagger (http://ilk.uvt.nl/mbt/), and lemmatized with MBLEM (http://ilk.uvt.nl/mbma/), following the annotation scheme of the Corpus of Reference of Contemporary Portuguese.
提供机构:
ELG
创建时间:
2022-06-01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作