rinto/dgt-tm
收藏Hugging Face2024-11-30 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/rinto/dgt-tm
下载链接
链接失效反馈官方服务:
资源简介:
DGT-Translation Memory数据集包含了欧盟立法文件(Acquis Communautaire)的24种官方语言的平行文本,主要用于机器翻译、文本分类、文本生成等NLP任务。数据集以TMX格式存储,并通过Java和Python工具进行提取和预处理。数据涵盖了从2007年到2012年的欧盟立法文件,并且每年都会更新。数据集的使用受到欧盟委员会的许可条件限制,用户需要遵守相关的知识产权和使用条款。
The DGT Translation Memory is a multilingual translation memory provided by the Directorate-General for Translation of the European Commission, containing parallel texts of the European Unions legislative documents (Acquis Communautaire). The dataset includes translation units in 24 EU official languages, supporting various natural language processing tasks such as statistical machine translation, dictionary and ontology building, training and testing of multilingual information extraction software, automatic translation consistency checking, and testing and benchmarking of alignment software. The datasets preprocessing steps involve extracting sentence pairs from TMX files, with provided Python and Rust scripts for processing. Statistical information includes the number of translation units, words, and characters for each language. Usage conditions include intellectual property and software usage terms.
提供机构:
rinto



