softcatala/Europarl-catalan
收藏Hugging Face2022-10-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/softcatala/Europarl-catalan
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- no-annotation
language_creators:
- machine-generated
language:
- ca
- de
- en
license:
- cc-by-4.0
multilinguality:
- translation
size_categories:
- 1M<n<10M
source_datasets:
- extended|europarl_bilingual
task_categories:
- translation
task_ids: []
pretty_name: Catalan-English and Catalan-German aligned corpora to train NMT systems.
---
# Dataset Card for Tilde-MODEL-Catalan
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://www.softcatala.org/
- **Repository:** https://github.com/Softcatala/Europarl-catalan
- **Paper:**
- **Leaderboard:**
- **Point of Contact:**
### Dataset Summary
This dataset contains two dataset pairs corresponding to the Europarl corpus. Both the English and the German version are aligned with the Catalan translation, which has been obtained using Apertium's RBMT system from the Spanish version of the Spanish-English alignment. Catalan-German alignment has been obtained using this [alignment finder](https://github.com/davidcanovas/alignment-finder-with-pivot-language) from de-en and ca-en.
- Catalan-English: 1 965 735 segments.
- Catalan-German: 1 734 644 segments.
### Supported Tasks and Leaderboards
This dataset can be used to train NMT and SMT systems.
It has been used as a training corpus for the [Softcatalà machine translation engine](https://www.softcatala.org/traductor/).
### Languages
Catalan (`ca`).
German (`de`).
English (`en`).
## Dataset Structure
### Data Instances
[More Information Needed]
### Data Fields
Raw text.
### Data Splits
One file for language.
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[@softcatala](https://github.com/Softcatala)
[@jordimas](https://github.com/jordimas)
[@davidcanovas](https://github.com/davidcanovas)
### Licensing Information
[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).
### Citation Information
[More Information Needed]
### Contributions
[More Information Needed]
提供机构:
softcatala
原始信息汇总
数据集概述
数据集名称
- 名称: Tilde-MODEL-Catalan
- 别名: Catalan-English and Catalan-German aligned corpora to train NMT systems.
数据集描述
- 摘要: 该数据集包含两个数据集对,对应于Europarl语料库。英语和德语版本均与加泰罗尼亚语翻译对齐,其中加泰罗尼亚语-德语对齐是通过使用对齐查找器从de-en和ca-en获得的。
- 支持的任务: 用于训练NMT和SMT系统。
- 语言: 加泰罗尼亚语 (
ca), 德语 (de), 英语 (en)。
数据集结构
- 数据实例: 待补充。
- 数据字段: 原始文本。
- 数据分割: 每种语言一个文件。
数据集创建
- 源数据: 扩展自Europarl双语数据集。
- 许可证: CC BY 4.0。
数据集大小
- 大小类别: 1M<n<10M。
多语言性
- 多语言性: 翻译。
数据集创建者
- 注释创建者: 无注释。
- 语言创建者: 机器生成。
数据集使用考虑
- 社会影响: 待补充。
- 偏见讨论: 待补充。
- 其他已知限制: 待补充。
附加信息
- 数据集管理员: @softcatala, @jordimas, @davidcanovas。
- 贡献: 待补充。



