opus_books
收藏魔搭社区2025-12-05 更新2025-08-23 收录
下载链接:
https://modelscope.cn/datasets/Helsinki-NLP/opus_books
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for OPUS Books
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://opus.nlpl.eu/Books/corpus/version/Books
- **Repository:** [More Information Needed]
- **Paper:** https://aclanthology.org/L12-1246/
- **Leaderboard:** [More Information Needed]
- **Point of Contact:** [More Information Needed]
### Dataset Summary
This is a collection of copyright free books aligned by Andras Farkas, which are available from http://www.farkastranslations.com/bilingual_books.php
Note that the texts are rather dated due to copyright issues and that some of them are manually reviewed (check the meta-data at the top of the corpus files in XML). The source is multilingually aligned, which is available from http://www.farkastranslations.com/bilingual_books.php.
In OPUS, the alignment is formally bilingual but the multilingual alignment can be recovered from the XCES sentence alignment files. Note also that the alignment units from the original source may include multi-sentence paragraphs, which are split and sentence-aligned in OPUS.
All texts are freely available for personal, educational and research use. Commercial use (e.g. reselling as parallel books) and mass redistribution without explicit permission are not granted. Please acknowledge the source when using the data!
Books's Numbers:
- Languages: 16
- Bitexts: 64
- Number of files: 158
- Number of tokens: 19.50M
- Sentence fragments: 0.91M
### Supported Tasks and Leaderboards
Translation.
### Languages
The languages in the dataset are:
- ca
- de
- el
- en
- eo
- es
- fi
- fr
- hu
- it
- nl
- no
- pl
- pt
- ru
- sv
## Dataset Structure
### Data Instances
[More Information Needed]
### Data Fields
[More Information Needed]
### Data Splits
[More Information Needed]
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
[More Information Needed]
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
[More Information Needed]
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
All texts are freely available for personal, educational and research use. Commercial use (e.g. reselling as parallel books) and mass redistribution without explicit permission are not granted.
### Citation Information
Please acknowledge the source when using the data.
Please cite the following article if you use any part of the OPUS corpus in your own work:
```bibtex
@inproceedings{tiedemann-2012-parallel,
title = "Parallel Data, Tools and Interfaces in {OPUS}",
author = {Tiedemann, J{\"o}rg},
editor = "Calzolari, Nicoletta and
Choukri, Khalid and
Declerck, Thierry and
Do{\u{g}}an, Mehmet U{\u{g}}ur and
Maegaard, Bente and
Mariani, Joseph and
Moreno, Asuncion and
Odijk, Jan and
Piperidis, Stelios",
booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation ({LREC}'12)",
month = may,
year = "2012",
address = "Istanbul, Turkey",
publisher = "European Language Resources Association (ELRA)",
url = "http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf",
pages = "2214--2218",
}
```
### Contributions
Thanks to [@abhishekkrthakur](https://github.com/abhishekkrthakur) for adding this dataset.
# OPUS Books 数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集概览](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言覆盖](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [数据遴选依据](#curation-rationale)
- [源数据](#source-data)
- [标注信息](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏倚讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集管护者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献致谢](#contributions)
## 数据集描述
- **主页:** https://opus.nlpl.eu/Books/corpus/version/Books
- **代码仓库:** [更多信息待补充]
- **相关论文:** https://aclanthology.org/L12-1246/
- **排行榜:** [更多信息待补充]
- **联络人:** [更多信息待补充]
### 数据集概览
本数据集由Andras Farkas对齐的无版权书籍集合构成,可从 http://www.farkastranslations.com/bilingual_books.php 获取。
需注意,受版权问题影响,部分文本年代较早,且其中一部分已经过人工审核(可查看XML格式语料文件顶部的元数据)。原始数据源支持多语言对齐,同样可从 http://www.farkastranslations.com/bilingual_books.php 获取。
在OPUS语料库中,对齐格式默认仅支持双语,但可从XCES句子对齐文件中恢复多语言对齐信息。此外,原始数据源中的对齐单元可能包含多句段落,在OPUS中已被拆分并进行句子级对齐。
所有文本均可免费用于个人、教育及科研用途。未经明确许可,不得用于商业用途(如作为平行书籍转售)或大规模重新分发。使用该数据集时,请注明数据来源!
### 数据集统计
- 覆盖语言:16种
- 双语平行语料对:64组
- 数据文件数:158个
- 词元(Token)数:1950万
- 句子片段数:91万
### 支持任务与排行榜
支持任务:机器翻译。
### 语言覆盖
本数据集包含以下语言:
- ca(加泰罗尼亚语,Catalan)
- de(德语,German)
- el(希腊语,Greek)
- en(英语,English)
- eo(世界语,Esperanto)
- es(西班牙语,Spanish)
- fi(芬兰语,Finnish)
- fr(法语,French)
- hu(匈牙利语,Hungarian)
- it(意大利语,Italian)
- nl(荷兰语,Dutch)
- no(挪威语,Norwegian)
- pl(波兰语,Polish)
- pt(葡萄牙语,Portuguese)
- ru(俄语,Russian)
- sv(瑞典语,Swedish)
## 数据集结构
### 数据实例
[更多信息待补充]
### 数据字段
[更多信息待补充]
### 数据划分
[更多信息待补充]
## 数据集构建
### 数据遴选依据
[更多信息待补充]
### 源数据
[更多信息待补充]
#### 初始数据收集与标准化
[更多信息待补充]
#### 源语言生产者是谁?
[更多信息待补充]
### 标注信息
[更多信息待补充]
#### 标注流程
[更多信息待补充]
#### 标注人员是谁?
[更多信息待补充]
### 个人与敏感信息
[更多信息待补充]
## 数据集使用注意事项
### 数据集的社会影响
[更多信息待补充]
### 偏倚讨论
[更多信息待补充]
### 其他已知局限性
[更多信息待补充]
## 附加信息
### 数据集管护者
[更多信息待补充]
### 许可信息
所有文本均可免费用于个人、教育及科研用途。未经明确许可,不得用于商业用途(如作为平行书籍转售)或大规模重新分发。
### 引用信息
使用该数据集时,请注明数据来源。
若您在研究中使用OPUS语料库的任意部分,请引用以下论文:
bibtex
@inproceedings{tiedemann-2012-parallel,
title = "OPUS中的平行数据、工具与接口",
author = {Tiedemann, Jörg},
editor = "Calzolari, Nicoletta 和
Choukri, Khalid 和
Declerck, Thierry 和
Doğan, Mehmet Uğur 和
Maegaard, Bente 和
Mariani, Joseph 和
Moreno, Asuncion 和
Odijk, Jan 和
Piperidis, Stelios",
booktitle = "第八届国际语言资源与评估会议(LREC'12)论文集",
month = "5月",
year = "2012",
address = "土耳其伊斯坦布尔",
publisher = "欧洲语言资源协会(ELRA)",
url = "http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf",
pages = "2214--2218",
}
### 贡献致谢
感谢[@abhishekkrthakur](https://github.com/abhishekkrthakur)贡献本数据集。
提供机构:
maas
创建时间:
2025-08-16
搜集汇总
数据集介绍

背景与挑战
背景概述
OPUS Books数据集是一个多语言平行语料库,包含64个双语文本,覆盖16种语言(如英语、德语、法语等),总数据量为19.50M tokens和0.91M句子片段。数据来源于Andras Farkas对齐的版权免费书籍,但文本较旧,主要用于翻译任务,仅限于个人、教育和研究用途,商业用途需授权。
以上内容由遇见数据集搜集并总结生成



