five

EN-DE-Bidirectional-Europarl-UdS corpus

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/11127625
下载链接
链接失效反馈
官方服务:
资源简介:
Bidirectional document- and sentence-aligned corpus of Europarl proceedings (EN-DE, DE-EN). The corpus is described in the paper: Mitigating Translationese with GPT-4: Strategies and Performance Maria Kunilovskaya, Koel Dutta Chowdhury, Heike Przybyl, Cristina España-Bonet, Josef van Genabith in: Proceedings of the 25th Annual conference of the European Association for Machine Translation (EAMT-2024) 24-27 June, Sheffield (UK) Association for Computational Linguistics This version of Europarl-UdS corpus is built from the Europarl proceedings (published up to 10 July 2018) collected using Jose Martinez's pipeline (https://github.com/chozelinek/europarl) in November 2023. We include several versions and subsets of this parallel corpus: The initial input (0_align/xml_translationese/) to the parallel corpus-building pipeline xml_translationese.zip Raw-text-based sentence-aligned documents for both translation directions (columns=['sdoc_id', 'sseg_id', 'sseg', 'tseg', 'hunalign_qua']): deen_wide2018_cap0_score0.3.tsv.gz ende_wide2018_cap0_score0.5.tsv.gz meta.zip contains four files with XML tags with metadata to each document in the corpus.  The same documents annotated with Stanza (with the conllu-style vertical format for each segment and document wrapped in XML tags containing metadata) ORG_WR_DE_EN.conllu.xml.gz (original German) ORG_WR_EN_DE.conllu.xml.gz (original English) TR_DE_EN.conllu.xml.gz (translated English) TR_EN_DE.conllu.xml.gz (translated German) The sentence-level subset of the corpus with extracted morphosyntactic (or lexicogrammatical) features as described in the paper (documents longer than 450 tokens, 1500 documents per translation direction) seg-450-1500.feats.tsv.gz A multi-parallel subset of 200 most_translated documents re-written by GPT-4 under various prompting conditions as described in the paper ratio2.5_de_7aligned_2056segs.tsv ratio2.5_en_7aligned_2109segs.tsv Note that the annotation retains the alignment. The processing code is available here https://github.com/SFB1102/b7-b6-prompting-eamt2024/1_parse_extract_feats
创建时间:
2024-05-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作