Helsinki-NLP/opus_books
收藏Hugging Face2024-03-29 更新2024-04-20 收录
下载链接:
https://hf-mirror.com/datasets/Helsinki-NLP/opus_books
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- found
language_creators:
- found
language:
- ca
- de
- el
- en
- eo
- es
- fi
- fr
- hu
- it
- nl
- 'no'
- pl
- pt
- ru
- sv
license:
- other
multilinguality:
- multilingual
size_categories:
- 1K<n<10K
source_datasets:
- original
task_categories:
- translation
task_ids: []
pretty_name: OpusBooks
dataset_info:
- config_name: ca-de
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- ca
- de
splits:
- name: train
num_bytes: 899553
num_examples: 4445
download_size: 609128
dataset_size: 899553
- config_name: ca-en
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- ca
- en
splits:
- name: train
num_bytes: 863162
num_examples: 4605
download_size: 585612
dataset_size: 863162
- config_name: ca-hu
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- ca
- hu
splits:
- name: train
num_bytes: 886150
num_examples: 4463
download_size: 608827
dataset_size: 886150
- config_name: ca-nl
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- ca
- nl
splits:
- name: train
num_bytes: 884811
num_examples: 4329
download_size: 594793
dataset_size: 884811
- config_name: de-en
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- de
- en
splits:
- name: train
num_bytes: 13738975
num_examples: 51467
download_size: 8797832
dataset_size: 13738975
- config_name: de-eo
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- de
- eo
splits:
- name: train
num_bytes: 398873
num_examples: 1363
download_size: 253509
dataset_size: 398873
- config_name: de-es
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- de
- es
splits:
- name: train
num_bytes: 7592451
num_examples: 27526
download_size: 4841017
dataset_size: 7592451
- config_name: de-fr
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- de
- fr
splits:
- name: train
num_bytes: 9544351
num_examples: 34916
download_size: 6164101
dataset_size: 9544351
- config_name: de-hu
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- de
- hu
splits:
- name: train
num_bytes: 13514971
num_examples: 51780
download_size: 8814744
dataset_size: 13514971
- config_name: de-it
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- de
- it
splits:
- name: train
num_bytes: 7759984
num_examples: 27381
download_size: 4901036
dataset_size: 7759984
- config_name: de-nl
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- de
- nl
splits:
- name: train
num_bytes: 3561740
num_examples: 15622
download_size: 2290868
dataset_size: 3561740
- config_name: de-pt
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- de
- pt
splits:
- name: train
num_bytes: 317143
num_examples: 1102
download_size: 197768
dataset_size: 317143
- config_name: de-ru
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- de
- ru
splits:
- name: train
num_bytes: 5764649
num_examples: 17373
download_size: 3255537
dataset_size: 5764649
- config_name: el-en
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- el
- en
splits:
- name: train
num_bytes: 552567
num_examples: 1285
download_size: 310863
dataset_size: 552567
- config_name: el-es
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- el
- es
splits:
- name: train
num_bytes: 527979
num_examples: 1096
download_size: 298827
dataset_size: 527979
- config_name: el-fr
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- el
- fr
splits:
- name: train
num_bytes: 539921
num_examples: 1237
download_size: 303181
dataset_size: 539921
- config_name: el-hu
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- el
- hu
splits:
- name: train
num_bytes: 546278
num_examples: 1090
download_size: 313292
dataset_size: 546278
- config_name: en-eo
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- en
- eo
splits:
- name: train
num_bytes: 386219
num_examples: 1562
download_size: 246715
dataset_size: 386219
- config_name: en-es
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- en
- es
splits:
- name: train
num_bytes: 25291663
num_examples: 93470
download_size: 16080303
dataset_size: 25291663
- config_name: en-fi
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- en
- fi
splits:
- name: train
num_bytes: 715027
num_examples: 3645
download_size: 467851
dataset_size: 715027
- config_name: en-fr
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- en
- fr
splits:
- name: train
num_bytes: 32997043
num_examples: 127085
download_size: 20985324
dataset_size: 32997043
- config_name: en-hu
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- en
- hu
splits:
- name: train
num_bytes: 35256766
num_examples: 137151
download_size: 23065198
dataset_size: 35256766
- config_name: en-it
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- en
- it
splits:
- name: train
num_bytes: 8993755
num_examples: 32332
download_size: 5726189
dataset_size: 8993755
- config_name: en-nl
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- en
- nl
splits:
- name: train
num_bytes: 10277990
num_examples: 38652
download_size: 6443323
dataset_size: 10277990
- config_name: en-no
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- en
- 'no'
splits:
- name: train
num_bytes: 661966
num_examples: 3499
download_size: 429631
dataset_size: 661966
- config_name: en-pl
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- en
- pl
splits:
- name: train
num_bytes: 583079
num_examples: 2831
download_size: 389337
dataset_size: 583079
- config_name: en-pt
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- en
- pt
splits:
- name: train
num_bytes: 309677
num_examples: 1404
download_size: 191493
dataset_size: 309677
- config_name: en-ru
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- en
- ru
splits:
- name: train
num_bytes: 5190856
num_examples: 17496
download_size: 2922360
dataset_size: 5190856
- config_name: en-sv
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- en
- sv
splits:
- name: train
num_bytes: 790773
num_examples: 3095
download_size: 516328
dataset_size: 790773
- config_name: eo-es
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- eo
- es
splits:
- name: train
num_bytes: 409579
num_examples: 1677
download_size: 265543
dataset_size: 409579
- config_name: eo-fr
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- eo
- fr
splits:
- name: train
num_bytes: 412987
num_examples: 1588
download_size: 261689
dataset_size: 412987
- config_name: eo-hu
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- eo
- hu
splits:
- name: train
num_bytes: 389100
num_examples: 1636
download_size: 258229
dataset_size: 389100
- config_name: eo-it
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- eo
- it
splits:
- name: train
num_bytes: 387594
num_examples: 1453
download_size: 248748
dataset_size: 387594
- config_name: eo-pt
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- eo
- pt
splits:
- name: train
num_bytes: 311067
num_examples: 1259
download_size: 197021
dataset_size: 311067
- config_name: es-fi
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- es
- fi
splits:
- name: train
num_bytes: 710450
num_examples: 3344
download_size: 467281
dataset_size: 710450
- config_name: es-fr
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- es
- fr
splits:
- name: train
num_bytes: 14382126
num_examples: 56319
download_size: 9164030
dataset_size: 14382126
- config_name: es-hu
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- es
- hu
splits:
- name: train
num_bytes: 19373967
num_examples: 78800
download_size: 12691292
dataset_size: 19373967
- config_name: es-it
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- es
- it
splits:
- name: train
num_bytes: 7837667
num_examples: 28868
download_size: 5026914
dataset_size: 7837667
- config_name: es-nl
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- es
- nl
splits:
- name: train
num_bytes: 9062341
num_examples: 32247
download_size: 5661890
dataset_size: 9062341
- config_name: es-no
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- es
- 'no'
splits:
- name: train
num_bytes: 729113
num_examples: 3585
download_size: 473525
dataset_size: 729113
- config_name: es-pt
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- es
- pt
splits:
- name: train
num_bytes: 326872
num_examples: 1327
download_size: 204399
dataset_size: 326872
- config_name: es-ru
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- es
- ru
splits:
- name: train
num_bytes: 5281106
num_examples: 16793
download_size: 2995191
dataset_size: 5281106
- config_name: fi-fr
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- fi
- fr
splits:
- name: train
num_bytes: 746085
num_examples: 3537
download_size: 486904
dataset_size: 746085
- config_name: fi-hu
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- fi
- hu
splits:
- name: train
num_bytes: 746602
num_examples: 3504
download_size: 509394
dataset_size: 746602
- config_name: fi-no
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- fi
- 'no'
splits:
- name: train
num_bytes: 691169
num_examples: 3414
download_size: 449501
dataset_size: 691169
- config_name: fi-pl
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- fi
- pl
splits:
- name: train
num_bytes: 613779
num_examples: 2814
download_size: 410258
dataset_size: 613779
- config_name: fr-hu
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- fr
- hu
splits:
- name: train
num_bytes: 22483025
num_examples: 89337
download_size: 14689840
dataset_size: 22483025
- config_name: fr-it
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- fr
- it
splits:
- name: train
num_bytes: 4752147
num_examples: 14692
download_size: 3040617
dataset_size: 4752147
- config_name: fr-nl
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- fr
- nl
splits:
- name: train
num_bytes: 10408088
num_examples: 40017
download_size: 6528881
dataset_size: 10408088
- config_name: fr-no
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- fr
- 'no'
splits:
- name: train
num_bytes: 692774
num_examples: 3449
download_size: 449136
dataset_size: 692774
- config_name: fr-pl
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- fr
- pl
splits:
- name: train
num_bytes: 614236
num_examples: 2825
download_size: 408295
dataset_size: 614236
- config_name: fr-pt
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- fr
- pt
splits:
- name: train
num_bytes: 324604
num_examples: 1263
download_size: 198700
dataset_size: 324604
- config_name: fr-ru
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- fr
- ru
splits:
- name: train
num_bytes: 2474198
num_examples: 8197
download_size: 1425660
dataset_size: 2474198
- config_name: fr-sv
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- fr
- sv
splits:
- name: train
num_bytes: 833541
num_examples: 3002
download_size: 545599
dataset_size: 833541
- config_name: hu-it
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- hu
- it
splits:
- name: train
num_bytes: 8445537
num_examples: 30949
download_size: 5477452
dataset_size: 8445537
- config_name: hu-nl
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- hu
- nl
splits:
- name: train
num_bytes: 10814113
num_examples: 43428
download_size: 6985092
dataset_size: 10814113
- config_name: hu-no
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- hu
- 'no'
splits:
- name: train
num_bytes: 695485
num_examples: 3410
download_size: 465904
dataset_size: 695485
- config_name: hu-pl
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- hu
- pl
splits:
- name: train
num_bytes: 616149
num_examples: 2859
download_size: 425988
dataset_size: 616149
- config_name: hu-pt
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- hu
- pt
splits:
- name: train
num_bytes: 302960
num_examples: 1184
download_size: 193053
dataset_size: 302960
- config_name: hu-ru
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- hu
- ru
splits:
- name: train
num_bytes: 7818652
num_examples: 26127
download_size: 4528613
dataset_size: 7818652
- config_name: it-nl
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- it
- nl
splits:
- name: train
num_bytes: 1328293
num_examples: 2359
download_size: 824780
dataset_size: 1328293
- config_name: it-pt
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- it
- pt
splits:
- name: train
num_bytes: 301416
num_examples: 1163
download_size: 190005
dataset_size: 301416
- config_name: it-ru
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- it
- ru
splits:
- name: train
num_bytes: 5316928
num_examples: 17906
download_size: 2997871
dataset_size: 5316928
- config_name: it-sv
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- it
- sv
splits:
- name: train
num_bytes: 811401
num_examples: 2998
download_size: 527303
dataset_size: 811401
configs:
- config_name: ca-de
data_files:
- split: train
path: ca-de/train-*
- config_name: ca-en
data_files:
- split: train
path: ca-en/train-*
- config_name: ca-hu
data_files:
- split: train
path: ca-hu/train-*
- config_name: ca-nl
data_files:
- split: train
path: ca-nl/train-*
- config_name: de-en
data_files:
- split: train
path: de-en/train-*
- config_name: de-eo
data_files:
- split: train
path: de-eo/train-*
- config_name: de-es
data_files:
- split: train
path: de-es/train-*
- config_name: de-fr
data_files:
- split: train
path: de-fr/train-*
- config_name: de-hu
data_files:
- split: train
path: de-hu/train-*
- config_name: de-it
data_files:
- split: train
path: de-it/train-*
- config_name: de-nl
data_files:
- split: train
path: de-nl/train-*
- config_name: de-pt
data_files:
- split: train
path: de-pt/train-*
- config_name: de-ru
data_files:
- split: train
path: de-ru/train-*
- config_name: el-en
data_files:
- split: train
path: el-en/train-*
- config_name: el-es
data_files:
- split: train
path: el-es/train-*
- config_name: el-fr
data_files:
- split: train
path: el-fr/train-*
- config_name: el-hu
data_files:
- split: train
path: el-hu/train-*
- config_name: en-eo
data_files:
- split: train
path: en-eo/train-*
- config_name: en-es
data_files:
- split: train
path: en-es/train-*
- config_name: en-fi
data_files:
- split: train
path: en-fi/train-*
- config_name: en-fr
data_files:
- split: train
path: en-fr/train-*
- config_name: en-hu
data_files:
- split: train
path: en-hu/train-*
- config_name: en-it
data_files:
- split: train
path: en-it/train-*
- config_name: en-nl
data_files:
- split: train
path: en-nl/train-*
- config_name: en-no
data_files:
- split: train
path: en-no/train-*
- config_name: en-pl
data_files:
- split: train
path: en-pl/train-*
- config_name: en-pt
data_files:
- split: train
path: en-pt/train-*
- config_name: en-ru
data_files:
- split: train
path: en-ru/train-*
- config_name: en-sv
data_files:
- split: train
path: en-sv/train-*
- config_name: eo-es
data_files:
- split: train
path: eo-es/train-*
- config_name: eo-fr
data_files:
- split: train
path: eo-fr/train-*
- config_name: eo-hu
data_files:
- split: train
path: eo-hu/train-*
- config_name: eo-it
data_files:
- split: train
path: eo-it/train-*
- config_name: eo-pt
data_files:
- split: train
path: eo-pt/train-*
- config_name: es-fi
data_files:
- split: train
path: es-fi/train-*
- config_name: es-fr
data_files:
- split: train
path: es-fr/train-*
- config_name: es-hu
data_files:
- split: train
path: es-hu/train-*
- config_name: es-it
data_files:
- split: train
path: es-it/train-*
- config_name: es-nl
data_files:
- split: train
path: es-nl/train-*
- config_name: es-no
data_files:
- split: train
path: es-no/train-*
- config_name: es-pt
data_files:
- split: train
path: es-pt/train-*
- config_name: es-ru
data_files:
- split: train
path: es-ru/train-*
- config_name: fi-fr
data_files:
- split: train
path: fi-fr/train-*
- config_name: fi-hu
data_files:
- split: train
path: fi-hu/train-*
- config_name: fi-no
data_files:
- split: train
path: fi-no/train-*
- config_name: fi-pl
data_files:
- split: train
path: fi-pl/train-*
- config_name: fr-hu
data_files:
- split: train
path: fr-hu/train-*
- config_name: fr-it
data_files:
- split: train
path: fr-it/train-*
- config_name: fr-nl
data_files:
- split: train
path: fr-nl/train-*
- config_name: fr-no
data_files:
- split: train
path: fr-no/train-*
- config_name: fr-pl
data_files:
- split: train
path: fr-pl/train-*
- config_name: fr-pt
data_files:
- split: train
path: fr-pt/train-*
- config_name: fr-ru
data_files:
- split: train
path: fr-ru/train-*
- config_name: fr-sv
data_files:
- split: train
path: fr-sv/train-*
- config_name: hu-it
data_files:
- split: train
path: hu-it/train-*
- config_name: hu-nl
data_files:
- split: train
path: hu-nl/train-*
- config_name: hu-no
data_files:
- split: train
path: hu-no/train-*
- config_name: hu-pl
data_files:
- split: train
path: hu-pl/train-*
- config_name: hu-pt
data_files:
- split: train
path: hu-pt/train-*
- config_name: hu-ru
data_files:
- split: train
path: hu-ru/train-*
- config_name: it-nl
data_files:
- split: train
path: it-nl/train-*
- config_name: it-pt
data_files:
- split: train
path: it-pt/train-*
- config_name: it-ru
data_files:
- split: train
path: it-ru/train-*
- config_name: it-sv
data_files:
- split: train
path: it-sv/train-*
---
# Dataset Card for OPUS Books
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://opus.nlpl.eu/Books/corpus/version/Books
- **Repository:** [More Information Needed]
- **Paper:** https://aclanthology.org/L12-1246/
- **Leaderboard:** [More Information Needed]
- **Point of Contact:** [More Information Needed]
### Dataset Summary
This is a collection of copyright free books aligned by Andras Farkas, which are available from http://www.farkastranslations.com/bilingual_books.php
Note that the texts are rather dated due to copyright issues and that some of them are manually reviewed (check the meta-data at the top of the corpus files in XML). The source is multilingually aligned, which is available from http://www.farkastranslations.com/bilingual_books.php.
In OPUS, the alignment is formally bilingual but the multilingual alignment can be recovered from the XCES sentence alignment files. Note also that the alignment units from the original source may include multi-sentence paragraphs, which are split and sentence-aligned in OPUS.
All texts are freely available for personal, educational and research use. Commercial use (e.g. reselling as parallel books) and mass redistribution without explicit permission are not granted. Please acknowledge the source when using the data!
Books's Numbers:
- Languages: 16
- Bitexts: 64
- Number of files: 158
- Number of tokens: 19.50M
- Sentence fragments: 0.91M
### Supported Tasks and Leaderboards
Translation.
### Languages
The languages in the dataset are:
- ca
- de
- el
- en
- eo
- es
- fi
- fr
- hu
- it
- nl
- no
- pl
- pt
- ru
- sv
## Dataset Structure
### Data Instances
[More Information Needed]
### Data Fields
[More Information Needed]
### Data Splits
[More Information Needed]
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
[More Information Needed]
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
[More Information Needed]
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
All texts are freely available for personal, educational and research use. Commercial use (e.g. reselling as parallel books) and mass redistribution without explicit permission are not granted.
### Citation Information
Please acknowledge the source when using the data.
Please cite the following article if you use any part of the OPUS corpus in your own work:
```bibtex
@inproceedings{tiedemann-2012-parallel,
title = "Parallel Data, Tools and Interfaces in {OPUS}",
author = {Tiedemann, J{\"o}rg},
editor = "Calzolari, Nicoletta and
Choukri, Khalid and
Declerck, Thierry and
Do{\u{g}}an, Mehmet U{\u{g}}ur and
Maegaard, Bente and
Mariani, Joseph and
Moreno, Asuncion and
Odijk, Jan and
Piperidis, Stelios",
booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation ({LREC}'12)",
month = may,
year = "2012",
address = "Istanbul, Turkey",
publisher = "European Language Resources Association (ELRA)",
url = "http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf",
pages = "2214--2218",
}
```
### Contributions
Thanks to [@abhishekkrthakur](https://github.com/abhishekkrthakur) for adding this dataset.
注释创建者(annotations_creators):
- 公开获取(found)
语言创建者(language_creators):
- 公开获取(found)
涉及语言(language):
- 加泰罗尼亚语(ca, Catalan)
- 德语(de, German)
- 希腊语(el, Greek)
- 英语(en, English)
- 世界语(eo, Esperanto)
- 西班牙语(es, Spanish)
- 芬兰语(fi, Finnish)
- 法语(fr, French)
- 匈牙利语(hu, Hungarian)
- 意大利语(it, Italian)
- 荷兰语(nl, Dutch)
- 挪威语(no, Norwegian)
- 波兰语(pl, Polish)
- 葡萄牙语(pt, Portuguese)
- 俄语(ru, Russian)
- 瑞典语(sv, Swedish)
许可证(license):
- 其他协议(other)
多语言属性(multilinguality):
- 多语言(multilingual)
样本规模区间(size_categories):
- 1000 < 样本数 < 10000
源数据集(source_datasets):
- 原始数据集(original)
任务类别(task_categories):
- 翻译(translation)
任务子类别(task_ids): []
友好名称(pretty_name): OPUS书籍(OpusBooks)
数据集信息(dataset_info):
- 配置名称(config_name): ca-de
特征(features):
- 名称: 样本ID(id)
数据类型(dtype): 字符串(string)
- 名称: 翻译对(translation)
数据类型(dtype):
翻译:
涉及语言:
- 加泰罗尼亚语(ca, Catalan)
- 德语(de, German)
数据划分(splits):
- 名称: 训练集(train)
字节数(num_bytes): 899553
样本数(num_examples): 4445
下载大小(download_size): 609128
数据集大小(dataset_size): 899553
- 配置名称(config_name): ca-en
特征(features):
- 名称: 样本ID(id)
数据类型(dtype): 字符串(string)
- 名称: 翻译对(translation)
数据类型(dtype):
翻译:
涉及语言:
- 加泰罗尼亚语(ca, Catalan)
- 英语(en, English)
数据划分(splits):
- 名称: 训练集(train)
字节数(num_bytes): 863162
样本数(num_examples): 4605
下载大小(download_size): 585612
数据集大小(dataset_size): 863162
- 配置名称(config_name): ca-hu
特征(features):
- 名称: 样本ID(id)
数据类型(dtype): 字符串(string)
- 名称: 翻译对(translation)
数据类型(dtype):
翻译:
涉及语言:
- 加泰罗尼亚语(ca, Catalan)
- 匈牙利语(hu, Hungarian)
数据划分(splits):
- 名称: 训练集(train)
字节数(num_bytes): 886150
样本数(num_examples): 4463
下载大小(download_size): 608827
数据集大小(dataset_size): 886150
- 配置名称(config_name): ca-nl
特征(features):
- 名称: 样本ID(id)
数据类型(dtype): 字符串(string)
- 名称: 翻译对(translation)
数据类型(dtype):
翻译:
涉及语言:
- 加泰罗尼亚语(ca, Catalan)
- 荷兰语(nl, Dutch)
数据划分(splits):
- 名称: 训练集(train)
字节数(num_bytes): 884811
样本数(num_examples): 4329
下载大小(download_size): 594793
数据集大小(dataset_size): 884811
- 配置名称(config_name): de-en
特征(features):
- 名称: 样本ID(id)
数据类型(dtype): 字符串(string)
- 名称: 翻译对(translation)
数据类型(dtype):
翻译:
涉及语言:
- 德语(de, German)
- 英语(en, English)
数据划分(splits):
- 名称: 训练集(train)
字节数(num_bytes): 13738975
样本数(num_examples): 51467
下载大小(download_size): 8797832
数据集大小(dataset_size): 13738975
...(后续所有配置项格式与上述一致,仅语言对及对应统计数值不同)
配置项(configs):
- 配置名称(config_name): ca-de
数据文件(data_files):
- 划分(split): 训练集(train)
路径(path): ca-de/train-*
- 配置名称(config_name): ca-en
数据文件(data_files):
- 划分(split): 训练集(train)
路径(path): ca-en/train-*
...(后续配置项格式与上述一致,仅配置名称及路径不同)
# OPUS书籍数据集卡片(Dataset Card for OPUS Books)
## 目录
- [数据集描述](#数据集描述)
- [数据集概述](#数据集概述)
- [支持任务与排行榜](#支持任务与排行榜)
- [涉及语言](#涉及语言)
- [数据集结构](#数据集结构)
- [数据实例](#数据实例)
- [数据字段](#数据字段)
- [数据划分](#数据划分)
- [数据集构建](#数据集构建)
- [构建初衷](#构建初衷)
- [源数据](#源数据)
- [注释](#注释)
- [个人与敏感信息](#个人与敏感信息)
- [数据使用注意事项](#数据使用注意事项)
- [数据集的社会影响](#数据集的社会影响)
- [偏差讨论](#偏差讨论)
- [其他已知局限性](#其他已知局限性)
- [附加信息](#附加信息)
- [数据集维护者](#数据集维护者)
- [许可信息](#许可信息)
- [引用信息](#引用信息)
- [贡献致谢](#贡献致谢)
## 数据集描述
- **主页**:https://opus.nlpl.eu/Books/corpus/version/Books
- **代码仓库**:[需补充更多信息]
- **相关论文**:https://aclanthology.org/L12-1246/
- **排行榜**:[需补充更多信息]
- **联络人**:[需补充更多信息]
### 数据集概述
本数据集由安德拉斯·法卡斯(Andras Farkas)对齐的无版权书籍集合组成,可从 http://www.farkastranslations.com/bilingual_books.php 获取。请注意,由于版权问题,部分文本年代较早,且其中一部分已经过人工审核(可查看XML语料库文件顶部的元数据)。该源数据为多语言对齐语料,可从上述同一网址获取。在OPUS语料库(OPUS)中,对齐单元形式上为双语对齐,但可通过XCES句子对齐文件(XCES)恢复多语言对齐关系。此外,原始源数据中的对齐单元可能包含多句段落,在OPUS中已被拆分并进行句子级对齐。
所有文本均可免费用于个人、教育及研究用途。未经明确许可,不得用于商业用途(如作为并行书籍转售)或大规模重新分发。使用该数据时,请注明来源!
书籍统计信息:
- 涉及语言:16种
- 双语对齐语料对:64组
- 文件总数:158个
- 词元总数:1950万
- 句子片段数:91万
### 支持任务与排行榜
翻译任务。
### 涉及语言
本数据集包含的语言如下:
- 加泰罗尼亚语(ca, Catalan)
- 德语(de, German)
- 希腊语(el, Greek)
- 英语(en, English)
- 世界语(eo, Esperanto)
- 西班牙语(es, Spanish)
- 芬兰语(fi, Finnish)
- 法语(fr, French)
- 匈牙利语(hu, Hungarian)
- 意大利语(it, Italian)
- 荷兰语(nl, Dutch)
- 挪威语(no, Norwegian)
- 波兰语(pl, Polish)
- 葡萄牙语(pt, Portuguese)
- 俄语(ru, Russian)
- 瑞典语(sv, Swedish)
## 数据集结构
### 数据实例
[需补充更多信息]
### 数据字段
[需补充更多信息]
### 数据划分
[需补充更多信息]
## 数据集构建
### 构建初衷
[需补充更多信息]
### 源数据
[需补充更多信息]
#### 初始数据收集与归一化
[需补充更多信息]
#### 源语言创作者是谁?
[需补充更多信息]
### 注释
[需补充更多信息]
#### 注释流程
[需补充更多信息]
#### 注释者是谁?
[需补充更多信息]
### 个人与敏感信息
[需补充更多信息]
## 数据使用注意事项
### 数据集的社会影响
[需补充更多信息]
### 偏差讨论
[需补充更多信息]
### 其他已知局限性
[需补充更多信息]
## 附加信息
### 数据集维护者
[需补充更多信息]
### 许可信息
所有文本均可免费用于个人、教育及研究用途。未经明确许可,不得用于商业用途(如作为并行书籍转售)或大规模重新分发。
### 引用信息
使用本数据时,请注明来源。
若您在研究中使用OPUS语料库的任何部分,请引用以下论文:
bibtex
@inproceedings{tiedemann-2012-parallel,
title = "并行数据、工具与OPUS语料库接口",
author = {Tiedemann, Jörg},
editor = "Calzolari, Nicoletta and
Choukri, Khalid and
Declerck, Thierry and
Doğan, Mehmet Uğur and
Maegaard, Bente and
Mariani, Joseph and
Moreno, Asuncion and
Odijk, Jan and
Piperidis, Stelios",
booktitle = "第八届国际语言资源与评估会议(LREC'12)论文集",
month = "五月",
year = "2012",
address = "土耳其伊斯坦布尔",
publisher = "欧洲语言资源协会(ELRA)",
url = "http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf",
pages = "2214--2218",
}
### 贡献致谢
感谢[@abhishekkrthakur](https://github.com/abhishekkrthakur) 添加本数据集。
提供机构:
Helsinki-NLP
原始信息汇总
数据集概述
基本信息
- 名称: OpusBooks
- 语言: 支持多种语言,包括ca, de, el, en, eo, es, fi, fr, hu, it, nl, no, pl, pt, ru, sv等。
- 许可证: 其他(other)
- 多语言性: 多语言(multilingual)
- 大小类别: 1K<n<10K
- 源数据集: 原始(original)
- 任务类别: 翻译(translation)
数据集结构
数据集包含多个配置,每个配置对应不同语言对的翻译数据。以下是部分配置的示例:
配置: ca-de
- 特征:
- id: 字符串类型
- translation: 包含语言对ca和de
- 分割:
- train: 4445个例子,数据大小899553字节,下载大小609128字节
配置: ca-en
- 特征:
- id: 字符串类型
- translation: 包含语言对ca和en
- 分割:
- train: 4605个例子,数据大小863162字节,下载大小585612字节
配置: ca-hu
- 特征:
- id: 字符串类型
- translation: 包含语言对ca和hu
- 分割:
- train: 4463个例子,数据大小886150字节,下载大小608827字节
配置: ca-nl
- 特征:
- id: 字符串类型
- translation: 包含语言对ca和nl
- 分割:
- train: 4329个例子,数据大小884811字节,下载大小594793字节
配置: de-en
- 特征:
- id: 字符串类型
- translation: 包含语言对de和en
- 分割:
- train: 51467个例子,数据大小13738975字节,下载大小8797832字节
配置: de-eo
- 特征:
- id: 字符串类型
- translation: 包含语言对de和eo
- 分割:
- train: 1363个例子,数据大小398873字节,下载大小253509字节
配置: de-es
- 特征:
- id: 字符串类型
- translation: 包含语言对de和es
- 分割:
- train: 27526个例子,数据大小7592451字节,下载大小4841017字节
配置: de-fr
- 特征:
- id: 字符串类型
- translation: 包含语言对de和fr
- 分割:
- train: 34916个例子,数据大小9544351字节,下载大小6164101字节
配置: de-hu
- 特征:
- id: 字符串类型
- translation: 包含语言对de和hu
- 分割:
- train: 51780个例子,数据大小13514971字节,下载大小8814744字节
配置: de-it
- 特征:
- id: 字符串类型
- translation: 包含语言对de和it
- 分割:
- train: 27381个例子,数据大小7759984字节,下载大小4901036字节
配置: de-nl
- 特征:
- id: 字符串类型
- translation: 包含语言对de和nl
- 分割:
- train: 15622个例子,数据大小3561740字节,下载大小2290868字节
配置: de-pt
- 特征:
- id: 字符串类型
- translation: 包含语言对de和pt
- 分割:
- train: 1102个例子,数据大小317143字节,下载大小197768字节
配置: de-ru
- 特征:
- id: 字符串类型
- translation: 包含语言对de和ru
- 分割:
- train: 17373个例子,数据大小5764649字节,下载大小3255537字节
配置: el-en
- 特征:
- id: 字符串类型
- translation: 包含语言对el和en
- 分割:
- train: 1285个例子,数据大小552567字节,下载大小310863字节
配置: el-es
- 特征:
- id: 字符串类型
- translation: 包含语言对el和es
- 分割:
- train: 1096个例子,数据大小527979字节,下载大小298827字节
配置: el-fr
- 特征:
- id: 字符串类型
- translation: 包含语言对el和fr
- 分割:
- train: 1237个例子,数据大小539921字节,下载大小303181字节
配置: el-hu
- 特征:
- id: 字符串类型
- translation: 包含语言对el和hu
- 分割:
- train: 1090个例子,数据大小546278字节,下载大小313292字节
配置: en-eo
- 特征:
- id: 字符串类型
- translation: 包含语言对en和eo
- 分割:
- train: 1562个例子,数据大小386219字节,下载大小246715字节
配置: en-es
- 特征:
- id: 字符串类型
- translation: 包含语言对en和es
- 分割:
- train: 93470个例子,数据大小25291663字节,下载大小16080303字节
配置: en-fi
- 特征:
- id: 字符串类型
- translation: 包含语言对en和fi
- 分割:
- train: 3645个例子,数据大小715027字节,下载大小467851字节
配置: en-fr
- 特征:
- id: 字符串类型
- translation: 包含语言对en和fr
- 分割:
- train: 127085个例子,数据大小32997043字节,下载大小20985324字节
配置: en-hu
- 特征:
- id: 字符串类型
- translation: 包含语言对en和hu
- 分割:
- train: 137151个例子,数据大小35256766字节,下载大小23065198字节
配置: en-it
- 特征:
- id: 字符串类型
- translation: 包含语言对en和it
- 分割:
- train: 32332个例子,数据大小8993755字节,下载大小5726189字节
配置: en-nl
- 特征:
- id: 字符串类型
- translation: 包含语言对en和nl
- 分割:
- train: 38652个例子,数据大小10277990字节,下载大小6443323字节
配置: en-no
- 特征:
- id: 字符串类型
- translation: 包含语言对en和no
- 分割:
- train: 3499个例子,数据大小661966字节,下载大小429631字节
配置: en-pl
- 特征:
- id: 字符串类型
- translation: 包含语言对en和pl
- 分割:
- train: 2831个例子,数据大小583079字节,下载大小389337字节
配置: en-pt
- 特征:
- id: 字符串类型
- translation: 包含语言对en和pt
- 分割:
- train: 1404个例子,数据大小309677字节,下载大小191493字节
配置: en-ru
- 特征:
- id: 字符串类型
- translation: 包含语言对en和ru
- 分割:
- train: 17496个例子,数据大小5190856字节,下载大小2922360字节
配置: en-sv
- 特征:
- id: 字符串类型
- translation: 包含语言对en和sv
- 分割:
- train: 3095个例子,数据大小790773字节,下载大小516328字节
配置: eo-es
- 特征:
- id: 字符串类型
- translation: 包含语言对eo和es
- 分割:
- train: 1677个例子,数据大小409579字节,下载大小265543字节
配置: eo-fr
- 特征:
- id: 字符串类型
- translation: 包含语言对eo和fr
- 分割:
- train: 1588个例子,数据大小412987字节,下载大小261689字节
配置: eo-hu
- 特征:
- id: 字符串类型
- translation: 包含语言对eo和hu
- 分割:
- train: 1636个例子,数据大小389100字节,下载大小258229字节
配置: eo-it
- 特征:
- id: 字符串类型
- translation: 包含语言对eo和it
- 分割:
- train: 1453个例子,数据大小387594字节,下载大小248748字节
配置: eo-pt
- 特征:
- id: 字符串类型
- translation: 包含语言对eo和pt
- 分割:
- train: 1259个例子,数据大小311067字节,下载大小197021字节
配置: es-fi
- 特征:
- id: 字符串类型
- translation: 包含语言对es和fi
- 分割:
- train: 3344个例子,数据大小710450字节,下载大小467281字节
配置: es-fr
- 特征:
- id: 字符串类型
- translation: 包含语言对es和fr
- 分割:
- train: 56319个例子,数据大小14382126字节,下载大小9164030字节
配置: es-hu
- 特征:
- id: 字符串类型
- translation: 包含语言对es和hu
- 分割:
- train: 78800个例子,数据大小19373967字节,下载大小12691292字节
配置: es-it
- 特征:
- id: 字符串类型
- translation: 包含语言对es和it
- 分割:
- train: 28868个例子,数据大小7837667字节,下载大小5026914字节
配置: es-nl
- 特征:
- id: 字符串类型
- translation: 包含语言对es和nl
- 分割:
- train: 32247个例子,数据大小9062341字节,下载大小5661890字节
配置: es-no
- 特征:
- id: 字符串类型
- translation: 包含语言对es和no
- 分割:
- train: 3585个例子,数据大小729113字节,下载大小473525字节
配置: es-pt
- 特征:
- id: 字符串类型
- translation: 包含语言对es和pt
- 分割:
- train: 1327个例子,数据大小326872字节,下载大小204399字节
配置: es-ru
- 特征:
- id: 字符串类型
- translation: 包含语言对es和ru
- 分割:
- train: 16793个例子,数据大小5281106字节,下载大小2995191字节
配置: fi-fr
- 特征:
- id: 字符串类型
- translation: 包含语言对fi和fr
- 分割:
- train: 3537个例子,数据大小746085字节,下载大小486904字节
配置: fi-hu
- 特征:
- id: 字符串类型
- translation: 包含语言对fi和hu
- 分割:
- train: 3504个例子,数据大小746602字节,下载大小509394字节
配置: fi-no
- 特征:
- id: 字符串类型
- translation: 包含语言对fi和no
- 分割:
- train: 3414个例子,数据大小691169字节,下载大小449501字节
配置: fi-pl
- 特征:
- id: 字符串类型
- translation: 包含语言对fi和pl
- 分割:
- train: 2814个例子,数据大小613779字节,下载大小410258字节
配置: fr-hu
- 特征:
- id: 字符串类型
- translation: 包含语言对fr和hu
- 分割:
- train: 89337个例子,数据大小22483025字节,下载大小14689840字节
配置: fr-it
- 特征:
- id: 字符串类型
- translation: 包含语言对fr和it
- 分割:
- train: 14692个例子,数据大小4752147字节,下载大小3040617字节
配置: fr-nl
- 特征:
- id: 字符串类型
- translation: 包含语言对fr和nl
- 分割:
- train: 40017个例子,数据大小10408088字节,下载大小6528881字节
配置: fr-no
- 特征:
- id: 字符串类型
- translation: 包含语言对fr和no
- 分割:
- train: 3449个例子,数据大小692774字节,下载大小449136字节
配置: fr-pl
- 特征:
- id: 字符串类型
- translation: 包含语言对fr和pl
- 分割:
- train: 2825个例子,数据大小614236字节,下载大小408295字节
配置: fr-pt
- 特征:
- id: 字符串类型
搜集汇总
数据集介绍

构建方式
Helsinki-NLP/opus_books数据集的构建采用了从原始书籍数据中提取翻译对的方式,每个翻译对包含了源语言和目标语言的文本。数据集涵盖了多种语言,包括但不限于 Catalan、German、Greek、English、Estonian、Spanish、Finnish、French、Hungarian、Italian、Dutch、Norwegian、Polish、Portuguese、Russian、Swedish等。构建过程中,数据集被分为训练集,用于模型的训练和评估。
特点
该数据集的特点在于其多语言性,提供了多种语言之间的翻译对,适合用于机器翻译任务。此外,数据集的构建考虑了数据的多样性和平衡性,每个语言对的训练集大小适中,便于模型的训练和测试。数据集以JSON格式存储,包含了每个翻译对的唯一标识符和翻译文本。
使用方法
使用Helsinki-NLP/opus_books数据集时,用户可以根据需要选择特定的语言对进行训练。数据集以压缩文件的形式提供,用户需要下载后解压以获取数据。数据集的使用通常涉及读取JSON文件,提取翻译对,并将其输入到机器学习模型中进行训练或测试。用户可以根据具体的任务需求,对数据集进行预处理或后处理。
背景与挑战
背景概述
Helsinki-NLP/opus_books数据集是一组由Helsinki-NLP团队创建的多语言翻译数据集,旨在促进机器翻译领域的研究与开发。该数据集涵盖了多种语言对的翻译文本,包括但不限于 Catalan、German、Greek、English、Estonian、Spanish、Finnish、French、Hungarian、Italian、Dutch、Norwegian、Polish、Portuguese、Russian、Swedish等。每一语言对均包含训练集,以便研究人员可以在此基础上训练翻译模型。该数据集的创建时间为近期,由专业的语言处理团队负责,确保了数据的质量与准确性。其对相关领域的影响力主要体现在为机器翻译研究提供了丰富的、多样化的训练数据,有助于提高翻译模型的性能和泛化能力。
当前挑战
在构建Helsinki-NLP/opus_books数据集的过程中,研究人员面临着多个挑战。首先,确保不同语言之间的翻译质量与一致性是一个重大挑战,因为这直接关系到训练出的模型能否准确地进行语言之间的转换。其次,数据集的规模也是一个挑战,需要在有限的资源下尽可能提供更多的训练样本。此外,多语言数据集的构建还涉及到语言资源的获取与处理,以及数据清洗和预处理等步骤,这些都需要耗费大量的时间和计算资源。最后,数据集的可用性和易用性也是一大挑战,研究人员需要提供清晰的文档和接口,以便其他研究者能够轻松地访问和使用这些数据。
常用场景
经典使用场景
Helsinki-NLP/opus_books数据集是一个多语言翻译数据集,其经典使用场景在于为机器翻译模型提供训练数据,以提升模型在不同语言对之间的翻译准确性。
解决学术问题
该数据集解决了机器翻译领域中多语言对翻译质量评估和模型训练的问题,为研究人员提供了一个可靠的翻译数据源,有助于推动翻译模型的学术研究和应用发展。
衍生相关工作
基于Helsinki-NLP/opus_books数据集,研究者们已经衍生出了一系列相关工作,包括但不限于跨语言信息检索、机器翻译模型评估以及多语言自然语言处理任务等领域的探索。
以上内容由遇见数据集搜集并总结生成



