five

GeFRePaC - German French Reciprocal Parallel Corpus

收藏
catalogue.elra.info2017-06-26 更新2025-03-26 收录
下载链接:
https://catalogue.elra.info/en-us/repository/browse/ELRA-W0031/
下载链接
链接失效反馈
官方服务:
资源简介:
The German-French Reciprocal Parallel Corpus (GeFRePaC) was produced by the Multilinguale Forschung/Multilingual Research Abteilung Lexik, Institut für Deutsche Sprache (Germany) through a funding from ELRA in the framework of the European Commission project LRsP&P (Language Resources Production & Packaging - LE4-8335).The German-French Reciprocal Parallel Corpus (GeFRePaC) is a 30 million word corpus (15 million for each language) for the purpose of developing, enhancing and improving translation aids (dictionaries, lexicons, platforms) for French-German and German-French translation. The database consists of the following parallel corpora:European Union CELEX Database: Treaties, Foreign relations, Law, Complementar Law and all the published documents of the "European Parliament".Celex-Database: 22,000,000 words (German+French)Europarl: 8,320,000 words (German+French)It covers natural general language as used in public socio-political discourse and it has a focus on multilingual administration and commercial and legal documentation. GeFRePaC comprises a large variety of text types for which there is a rapidly growing need for translation but which currently defy successful machine translation. The corpus is encoded according to the PAROLE guidelines, it was aligned on the sentence level and also for single word translation units on the lexical level, POS-tagged in conformity with EAGLES recommendations and validated according to the most current version of the ELRA guidelines. The parallel German-French texts were aligned using a program developed at the Equipe Langue et Dialogue, Laboratoire Loria, Nancy. The text files containing markup for paragraphs and sentences were processed by the Tree Tagger developed at the IMS Stuttgart. The text files are automatically converted into TEI-conformant SGML format.

德国-法语互译平行语料库(GeFRePaC)由德国语言研究所(Institut für Deutsche Sprache)的多语言研究部(Multilinguale Forschung/Multilingual Research Abteilung Lexik)在欧盟委员会LRsP&P(语言资源生产与包装 - LE4-8335)项目框架下,经欧洲语言资源局(ELRA)资助而制作。德国-法语互译平行语料库(GeFRePaC)是一个包含3000万词汇(每种语言各1500万词汇)的语料库,旨在开发、完善和提升法德与德法翻译的辅助工具(如词典、术语表、平台)。该数据库包括以下平行语料库:欧盟CELEX数据库:条约、外交关系、法律、补充法律以及“欧洲议会”发布的所有文件。Celex-Database:2200万词汇(德语+法语)Europarl:832万词汇(德语+法语)。该语料库涵盖了用于公共社会政治话语的自然通用语言,并专注于多语言行政以及商业和法律文件。GeFRePaC包含了大量日益增长的翻译需求但当前难以实现机器翻译成功的文本类型。语料库的编码遵循PAROLE指南,句子级和词汇级单词翻译单元均进行了对齐,词性标注符合EAGLES建议,并根据ELRA最新版本指南进行了验证。德法语料库的平行文本对齐使用的是在Langue et Dialogue小组,洛里亚实验室,南锡开发的程序。包含段落和句子标记的文本文件由在斯图加特IMS开发的Tree Tagger进行处理。文本文件被自动转换为符合TEI标准的SGML格式。
提供机构:
catalogue.elra.info
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作