five

softcatala/Europarl-catalan

收藏
Hugging Face2022-10-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/softcatala/Europarl-catalan
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - no-annotation language_creators: - machine-generated language: - ca - de - en license: - cc-by-4.0 multilinguality: - translation size_categories: - 1M<n<10M source_datasets: - extended|europarl_bilingual task_categories: - translation task_ids: [] pretty_name: Catalan-English and Catalan-German aligned corpora to train NMT systems. --- # Dataset Card for Tilde-MODEL-Catalan ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://www.softcatala.org/ - **Repository:** https://github.com/Softcatala/Europarl-catalan - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary This dataset contains two dataset pairs corresponding to the Europarl corpus. Both the English and the German version are aligned with the Catalan translation, which has been obtained using Apertium's RBMT system from the Spanish version of the Spanish-English alignment. Catalan-German alignment has been obtained using this [alignment finder](https://github.com/davidcanovas/alignment-finder-with-pivot-language) from de-en and ca-en. - Catalan-English: 1 965 735 segments. - Catalan-German: 1 734 644 segments. ### Supported Tasks and Leaderboards This dataset can be used to train NMT and SMT systems. It has been used as a training corpus for the [Softcatalà machine translation engine](https://www.softcatala.org/traductor/). ### Languages Catalan (`ca`). German (`de`). English (`en`). ## Dataset Structure ### Data Instances [More Information Needed] ### Data Fields Raw text. ### Data Splits One file for language. ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [@softcatala](https://github.com/Softcatala) [@jordimas](https://github.com/jordimas) [@davidcanovas](https://github.com/davidcanovas) ### Licensing Information [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). ### Citation Information [More Information Needed] ### Contributions [More Information Needed]
提供机构:
softcatala
原始信息汇总

数据集概述

数据集名称

  • 名称: Tilde-MODEL-Catalan
  • 别名: Catalan-English and Catalan-German aligned corpora to train NMT systems.

数据集描述

  • 摘要: 该数据集包含两个数据集对,对应于Europarl语料库。英语和德语版本均与加泰罗尼亚语翻译对齐,其中加泰罗尼亚语-德语对齐是通过使用对齐查找器从de-en和ca-en获得的。
  • 支持的任务: 用于训练NMT和SMT系统。
  • 语言: 加泰罗尼亚语 (ca), 德语 (de), 英语 (en)。

数据集结构

  • 数据实例: 待补充。
  • 数据字段: 原始文本。
  • 数据分割: 每种语言一个文件。

数据集创建

  • 源数据: 扩展自Europarl双语数据集。
  • 许可证: CC BY 4.0

数据集大小

  • 大小类别: 1M<n<10M。

多语言性

  • 多语言性: 翻译。

数据集创建者

  • 注释创建者: 无注释。
  • 语言创建者: 机器生成。

数据集使用考虑

  • 社会影响: 待补充。
  • 偏见讨论: 待补充。
  • 其他已知限制: 待补充。

附加信息

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作