EN-DE-Bidirectional-Europarl-UdS corpus
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/11127625
下载链接
链接失效反馈官方服务:
资源简介:
Bidirectional document- and sentence-aligned corpus of Europarl proceedings (EN-DE, DE-EN).
The corpus is described in the paper:
Mitigating Translationese with GPT-4: Strategies and Performance
Maria Kunilovskaya, Koel Dutta Chowdhury, Heike Przybyl, Cristina España-Bonet, Josef van Genabith
in: Proceedings of the 25th Annual conference of the European Association for Machine Translation (EAMT-2024) 24-27 June, Sheffield (UK) Association for Computational Linguistics
This version of Europarl-UdS corpus is built from the Europarl proceedings (published up to 10 July 2018) collected using Jose Martinez's pipeline (https://github.com/chozelinek/europarl) in November 2023.
We include several versions and subsets of this parallel corpus:
The initial input (0_align/xml_translationese/) to the parallel corpus-building pipeline
xml_translationese.zip
Raw-text-based sentence-aligned documents for both translation directions (columns=['sdoc_id', 'sseg_id', 'sseg', 'tseg', 'hunalign_qua']):
deen_wide2018_cap0_score0.3.tsv.gz
ende_wide2018_cap0_score0.5.tsv.gz
meta.zip contains four files with XML tags with metadata to each document in the corpus.
The same documents annotated with Stanza (with the conllu-style vertical format for each segment and document wrapped in XML tags containing metadata)
ORG_WR_DE_EN.conllu.xml.gz (original German)
ORG_WR_EN_DE.conllu.xml.gz (original English)
TR_DE_EN.conllu.xml.gz (translated English)
TR_EN_DE.conllu.xml.gz (translated German)
The sentence-level subset of the corpus with extracted morphosyntactic (or lexicogrammatical) features as described in the paper (documents longer than 450 tokens, 1500 documents per translation direction)
seg-450-1500.feats.tsv.gz
A multi-parallel subset of 200 most_translated documents re-written by GPT-4 under various prompting conditions as described in the paper
ratio2.5_de_7aligned_2056segs.tsv
ratio2.5_en_7aligned_2109segs.tsv
Note that the annotation retains the alignment. The processing code is available here https://github.com/SFB1102/b7-b6-prompting-eamt2024/1_parse_extract_feats
创建时间:
2024-05-07



