Helsinki-NLP/opus_openoffice
收藏Hugging Face2024-02-22 更新2024-04-20 收录
下载链接:
https://hf-mirror.com/datasets/Helsinki-NLP/opus_openoffice
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- found
language_creators:
- found
language:
- de
- en
- es
- fr
- ja
- ru
- sv
- zh
license:
- unknown
multilinguality:
- multilingual
size_categories:
- 10K<n<100K
source_datasets:
- original
task_categories:
- translation
task_ids: []
pretty_name: OpusOpenoffice
config_names:
- de-en_GB
- de-es
- de-fr
- de-ja
- de-ru
- de-sv
- de-zh_CN
- en_GB-es
- en_GB-fr
- en_GB-ja
- en_GB-ru
- en_GB-sv
- en_GB-zh_CN
- es-fr
- es-ja
- es-ru
- es-sv
- es-zh_CN
- fr-ja
- fr-ru
- fr-sv
- fr-zh_CN
- ja-ru
- ja-sv
- ja-zh_CN
- ru-sv
- ru-zh_CN
- sv-zh_CN
language_bcp47:
- en-GB
- zh-CN
dataset_info:
- config_name: de-en_GB
features:
- name: translation
dtype:
translation:
languages:
- de
- en_GB
splits:
- name: train
num_bytes: 6201077
num_examples: 77052
download_size: 2983173
dataset_size: 6201077
- config_name: de-es
features:
- name: translation
dtype:
translation:
languages:
- de
- es
splits:
- name: train
num_bytes: 6571615
num_examples: 77000
download_size: 3145841
dataset_size: 6571615
- config_name: de-fr
features:
- name: translation
dtype:
translation:
languages:
- de
- fr
splits:
- name: train
num_bytes: 6715805
num_examples: 76684
download_size: 3167189
dataset_size: 6715805
- config_name: de-ja
features:
- name: translation
dtype:
translation:
languages:
- de
- ja
splits:
- name: train
num_bytes: 7084951
num_examples: 69396
download_size: 3137719
dataset_size: 7084951
- config_name: de-ru
features:
- name: translation
dtype:
translation:
languages:
- de
- ru
splits:
- name: train
num_bytes: 8333241
num_examples: 75511
download_size: 3585304
dataset_size: 8333241
- config_name: de-sv
features:
- name: translation
dtype:
translation:
languages:
- de
- sv
splits:
- name: train
num_bytes: 6288962
num_examples: 77366
download_size: 3053987
dataset_size: 6288962
- config_name: de-zh_CN
features:
- name: translation
dtype:
translation:
languages:
- de
- zh_CN
splits:
- name: train
num_bytes: 5836628
num_examples: 68712
download_size: 2862703
dataset_size: 5836628
- config_name: en_GB-es
features:
- name: translation
dtype:
translation:
languages:
- en_GB
- es
splits:
- name: train
num_bytes: 6147581
num_examples: 77646
download_size: 2933203
dataset_size: 6147581
- config_name: en_GB-fr
features:
- name: translation
dtype:
translation:
languages:
- en_GB
- fr
splits:
- name: train
num_bytes: 6297779
num_examples: 77696
download_size: 2952170
dataset_size: 6297779
- config_name: en_GB-ja
features:
- name: translation
dtype:
translation:
languages:
- en_GB
- ja
splits:
- name: train
num_bytes: 6636722
num_examples: 69149
download_size: 2920159
dataset_size: 6636722
- config_name: en_GB-ru
features:
- name: translation
dtype:
translation:
languages:
- en_GB
- ru
splits:
- name: train
num_bytes: 7877970
num_examples: 75401
download_size: 3356420
dataset_size: 7877970
- config_name: en_GB-sv
features:
- name: translation
dtype:
translation:
languages:
- en_GB
- sv
splits:
- name: train
num_bytes: 5861461
num_examples: 77815
download_size: 2839624
dataset_size: 5861461
- config_name: en_GB-zh_CN
features:
- name: translation
dtype:
translation:
languages:
- en_GB
- zh_CN
splits:
- name: train
num_bytes: 5424865
num_examples: 69400
download_size: 2663377
dataset_size: 5424865
- config_name: es-fr
features:
- name: translation
dtype:
translation:
languages:
- es
- fr
splits:
- name: train
num_bytes: 6663092
num_examples: 77417
download_size: 3115129
dataset_size: 6663092
- config_name: es-ja
features:
- name: translation
dtype:
translation:
languages:
- es
- ja
splits:
- name: train
num_bytes: 7005123
num_examples: 68944
download_size: 3075174
dataset_size: 7005123
- config_name: es-ru
features:
- name: translation
dtype:
translation:
languages:
- es
- ru
splits:
- name: train
num_bytes: 8283703
num_examples: 76461
download_size: 3533017
dataset_size: 8283703
- config_name: es-sv
features:
- name: translation
dtype:
translation:
languages:
- es
- sv
splits:
- name: train
num_bytes: 6232466
num_examples: 77825
download_size: 2999454
dataset_size: 6232466
- config_name: es-zh_CN
features:
- name: translation
dtype:
translation:
languages:
- es
- zh_CN
splits:
- name: train
num_bytes: 5776827
num_examples: 68583
download_size: 2815094
dataset_size: 5776827
- config_name: fr-ja
features:
- name: translation
dtype:
translation:
languages:
- fr
- ja
splits:
- name: train
num_bytes: 7160332
num_examples: 69026
download_size: 3104825
dataset_size: 7160332
- config_name: fr-ru
features:
- name: translation
dtype:
translation:
languages:
- fr
- ru
splits:
- name: train
num_bytes: 8432061
num_examples: 76464
download_size: 3553215
dataset_size: 8432061
- config_name: fr-sv
features:
- name: translation
dtype:
translation:
languages:
- fr
- sv
splits:
- name: train
num_bytes: 6373350
num_examples: 77398
download_size: 3020247
dataset_size: 6373350
- config_name: fr-zh_CN
features:
- name: translation
dtype:
translation:
languages:
- fr
- zh_CN
splits:
- name: train
num_bytes: 5918482
num_examples: 68723
download_size: 2834942
dataset_size: 5918482
- config_name: ja-ru
features:
- name: translation
dtype:
translation:
languages:
- ja
- ru
splits:
- name: train
num_bytes: 8781230
num_examples: 68589
download_size: 3534714
dataset_size: 8781230
- config_name: ja-sv
features:
- name: translation
dtype:
translation:
languages:
- ja
- sv
splits:
- name: train
num_bytes: 6709627
num_examples: 69154
download_size: 2983777
dataset_size: 6709627
- config_name: ja-zh_CN
features:
- name: translation
dtype:
translation:
languages:
- ja
- zh_CN
splits:
- name: train
num_bytes: 6397676
num_examples: 68953
download_size: 2877818
dataset_size: 6397676
- config_name: ru-sv
features:
- name: translation
dtype:
translation:
languages:
- ru
- sv
splits:
- name: train
num_bytes: 7966150
num_examples: 75560
download_size: 3425447
dataset_size: 7966150
- config_name: ru-zh_CN
features:
- name: translation
dtype:
translation:
languages:
- ru
- zh_CN
splits:
- name: train
num_bytes: 7393659
num_examples: 66259
download_size: 3224677
dataset_size: 7393659
- config_name: sv-zh_CN
features:
- name: translation
dtype:
translation:
languages:
- sv
- zh_CN
splits:
- name: train
num_bytes: 5492902
num_examples: 68846
download_size: 2722340
dataset_size: 5492902
configs:
- config_name: de-en_GB
data_files:
- split: train
path: de-en_GB/train-*
- config_name: de-es
data_files:
- split: train
path: de-es/train-*
- config_name: de-fr
data_files:
- split: train
path: de-fr/train-*
- config_name: de-ja
data_files:
- split: train
path: de-ja/train-*
- config_name: de-ru
data_files:
- split: train
path: de-ru/train-*
- config_name: de-sv
data_files:
- split: train
path: de-sv/train-*
- config_name: de-zh_CN
data_files:
- split: train
path: de-zh_CN/train-*
- config_name: en_GB-es
data_files:
- split: train
path: en_GB-es/train-*
- config_name: en_GB-fr
data_files:
- split: train
path: en_GB-fr/train-*
- config_name: en_GB-ja
data_files:
- split: train
path: en_GB-ja/train-*
- config_name: en_GB-ru
data_files:
- split: train
path: en_GB-ru/train-*
- config_name: en_GB-sv
data_files:
- split: train
path: en_GB-sv/train-*
- config_name: en_GB-zh_CN
data_files:
- split: train
path: en_GB-zh_CN/train-*
- config_name: es-fr
data_files:
- split: train
path: es-fr/train-*
- config_name: es-ja
data_files:
- split: train
path: es-ja/train-*
- config_name: es-ru
data_files:
- split: train
path: es-ru/train-*
- config_name: es-sv
data_files:
- split: train
path: es-sv/train-*
- config_name: es-zh_CN
data_files:
- split: train
path: es-zh_CN/train-*
- config_name: fr-ja
data_files:
- split: train
path: fr-ja/train-*
- config_name: fr-ru
data_files:
- split: train
path: fr-ru/train-*
- config_name: fr-sv
data_files:
- split: train
path: fr-sv/train-*
- config_name: fr-zh_CN
data_files:
- split: train
path: fr-zh_CN/train-*
- config_name: ja-ru
data_files:
- split: train
path: ja-ru/train-*
- config_name: ja-sv
data_files:
- split: train
path: ja-sv/train-*
- config_name: ja-zh_CN
data_files:
- split: train
path: ja-zh_CN/train-*
- config_name: ru-sv
data_files:
- split: train
path: ru-sv/train-*
- config_name: ru-zh_CN
data_files:
- split: train
path: ru-zh_CN/train-*
- config_name: sv-zh_CN
data_files:
- split: train
path: sv-zh_CN/train-*
---
# Dataset Card for [Dataset Name]
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://opus.nlpl.eu/OpenOffice/corpus/version/OpenOffice
- **Repository:**
- **Paper:**
- **Leaderboard:**
- **Point of Contact:**
### Dataset Summary
A collection of documents from http://www.openoffice.org/.
8 languages, 28 bitexts
### Supported Tasks and Leaderboards
The underlying task is machine translation.
### Languages
[More Information Needed]
## Dataset Structure
### Data Instances
[More Information Needed]
### Data Fields
[More Information Needed]
### Data Splits
[More Information Needed]
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
[More Information Needed]
### Citation Information
```
@InProceedings{TIEDEMANN12.463,
author = {J�rg Tiedemann},
title = {Parallel Data, Tools and Interfaces in OPUS},
booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},
year = {2012},
month = {may},
date = {23-25},
address = {Istanbul, Turkey},
editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis},
publisher = {European Language Resources Association (ELRA)},
isbn = {978-2-9517408-7-7},
language = {english}
}
```
### Contributions
Thanks to [@patil-suraj](https://github.com/patil-suraj) for adding this dataset.
提供机构:
Helsinki-NLP
原始信息汇总
数据集概述
数据集基本信息
- 名称: OpusOpenoffice
- 语言: 多语言,包括德语(de)、英语(en)、西班牙语(es)、法语(fr)、日语(ja)、俄语(ru)、瑞典语(sv)、中文(zh)
- 许可证: 未知
- 多语言性: 多语言
- 大小: 10K<n<100K
- 源数据集: 原始数据
- 任务类别: 翻译
数据集结构
配置名称和语言对
- 配置名称: 包括多种语言对,如de-en_GB, de-es, de-fr等
- 语言对: 每种配置包含两种语言的翻译对,例如de-en_GB包含德语到英语(英国)的翻译
数据集大小和分割
- 训练集大小: 每个语言对的训练集大小不同,例如de-en_GB的训练集大小为6201077字节,包含77052个例子
- 下载大小: 每个语言对的下载大小,例如de-en_GB的下载大小为2983173字节
数据集创建
- 注释创建者: 发现
- 语言创建者: 发现
使用考虑
- 许可证: 未知,使用时需注意版权和使用限制
附加信息
-
引用信息:
@InProceedings{TIEDEMANN12.463, author = {J�rg Tiedemann}, title = {Parallel Data, Tools and Interfaces in OPUS}, booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC12)}, year = {2012}, month = {may}, date = {23-25}, address = {Istanbul, Turkey}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-7-7}, language = {english} }
搜集汇总
数据集介绍

构建方式
在机器翻译研究领域,多语言平行语料库的构建是推动跨语言理解技术发展的基石。OpusOpenoffice数据集源自OpenOffice.org项目的官方文档,通过系统化的采集与对齐流程构建而成。该数据集从开源办公软件的多语言文档中提取文本,涵盖了德语、英语、西班牙语、法语、日语、俄语、瑞典语及中文等八种语言,并形成了28组双向平行语料。构建过程中,文档经过格式解析与语言识别,确保原文与译文在句子级别精确对应,从而为机器翻译模型提供了高质量的训练资源。
特点
该数据集在跨语言自然语言处理任务中展现出鲜明的多语言覆盖特性。其核心特点在于囊括了八种广泛使用的语言,并提供了这些语言之间的全面双向翻译对,例如德语-英语、西班牙语-中文等组合。每个语言对的样本规模介于六万至八万条之间,数据量适中且均衡,适合用于模型训练与评估。数据以标准的翻译格式组织,每条记录包含源语言与目标语言的句子对,便于直接应用于神经机器翻译系统的训练流程。这种结构化的多语言设计为研究语言间的语义映射与翻译一致性提供了丰富素材。
使用方法
在应用该数据集进行机器翻译研究时,用户可通过HuggingFace数据集库直接加载特定语言对的配置。例如,选择'de-en_GB'配置即可获取德语与英式英语的平行语料。数据集仅包含训练分割,因此常需与其它数据集的验证集和测试集结合使用,以完成完整的模型训练与性能评估流程。研究人员可利用这些对齐的句子对,训练从传统统计模型到前沿神经网络的各类翻译系统,亦可用于多语言词嵌入学习或跨语言迁移学习任务的基准数据。其清晰的结构支持即插即用的实验设计,加速翻译技术的迭代与创新。
背景与挑战
背景概述
在机器翻译领域,多语言平行语料库的构建对于推动跨语言信息处理技术的发展具有关键作用。OpusOpenoffice数据集由赫尔辛基大学自然语言处理团队(Helsinki-NLP)于2012年前后创建,作为OPUS项目的重要组成部分,其核心研究问题在于解决开源办公软件文档的多语言对齐与翻译任务。该数据集涵盖了德语、英语、西班牙语、法语、日语、俄语、瑞典语和中文等八种语言,通过从OpenOffice.org平台提取的文档构建了28个双语对,为机器翻译模型提供了丰富的领域特定数据,显著促进了办公自动化场景下的语言技术应用与评估。
当前挑战
OpusOpenoffice数据集旨在应对办公文档翻译中的领域适应挑战,包括技术术语的一致性、句式结构的复杂性以及多语言文化差异的准确传达。在构建过程中,面临的主要挑战涉及原始文档的格式解析与语言对齐,需确保不同语言版本间的语义等价性,同时处理诸如日语和中文等非拉丁文字符的编码与分词问题。此外,数据规模的有限性以及某些语言对样本数量的不均衡分布,可能影响模型在低资源语言场景下的泛化能力。
常用场景
经典使用场景
在机器翻译领域,多语言平行语料库是模型训练与评估的基石。OpusOpenoffice数据集作为OpenOffice文档的多语言翻译对集合,其经典使用场景在于为统计机器翻译和神经机器翻译模型提供高质量的平行文本训练资源。该数据集覆盖德语、英语、西班牙语、法语、日语、俄语、瑞典语和中文等多种语言组合,尤其适用于跨语言信息检索和文档对齐任务的研究。通过利用这些真实办公文档的翻译对,研究者能够构建出在专业术语和正式文体上表现优异的翻译系统,从而提升技术文档的跨语言可访问性。
实际应用
在实际应用层面,OpusOpenoffice数据集为商业翻译工具和本地化服务提供了重要的训练基础。办公软件文档的翻译需求广泛存在于跨国企业和国际组织中,该数据集能够直接用于优化办公套件、技术手册和商务文件的自动翻译系统。例如,基于该数据训练的模型可集成到开源办公软件中,实现用户界面的实时多语言切换。同时,该数据集也支持构建专业领域的翻译记忆库,提高翻译效率并确保术语一致性,从而降低跨语言协作的沟通成本。
衍生相关工作
围绕OpusOpenoffice数据集,学术界衍生了一系列经典研究工作。这些工作主要集中于多语言神经机器翻译架构的优化,例如利用该数据集进行零样本翻译和跨语言迁移学习的实验验证。部分研究将其与OPUS语料库的其他子集结合,构建了更全面的多领域翻译评估基准。此外,该数据集常被用于分析办公文档的文体特征对翻译质量的影响,推动了领域特定翻译模型的设计。这些衍生工作不仅丰富了机器翻译的方法论,也为多语言自然语言处理技术的实际部署提供了重要参考。
以上内容由遇见数据集搜集并总结生成



