softcatala/Tilde-MODEL-Catalan
收藏Hugging Face2024-08-14 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/softcatala/Tilde-MODEL-Catalan
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators: []
language_creators:
- machine-generated
language:
- ca
- de
license:
- cc-by-4.0
multilinguality:
- translation
size_categories:
- 1M<n<10M
source_datasets:
- extended|tilde_model
task_categories:
- text2text-generation
- translation
task_ids: []
pretty_name: Catalan-German aligned corpora to train NMT systems.
tags:
- conditional-text-generation
---
# Dataset Card for Tilde-MODEL-Catalan
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://www.softcatala.org/
- **Repository:** https://github.com/Softcatala/Tilde-MODEL-catalan
- **Paper:**
- **Leaderboard:**
- **Point of Contact:**
### Dataset Summary
This dataset contains the German version of the Tilde-MODEL corpus aligned with a Catalan translation.
The catalan text has been obtained using Apertium's RBMT system from the Spanish version. It contains 3.4M segments.
### Supported Tasks and Leaderboards
This dataset can be used to train NMT and SMT systems.
It has been used as a training corpus for the [Softcatalà machine translation engine](https://www.softcatala.org/traductor/).
### Languages
Catalan (`ca`).
German (`de`).
## Dataset Structure
### Data Instances
[More Information Needed]
### Data Fields
Raw text.
### Data Splits
One file for language.
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[@softcatala](https://github.com/Softcatala)
[@jordimas](https://github.com/jordimas)
[@davidcanovas](https://github.com/davidcanovas)
### Licensing Information
[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).
### Citation Information
[More Information Needed]
### Contributions
[More Information Needed]
提供机构:
softcatala
原始信息汇总
数据集概述
数据集描述
数据集总结
- 内容: 包含德语版本的Tilde-MODEL语料库与加泰罗尼亚语翻译的对齐数据。
- 翻译方法: 使用Apertium的RBMT系统从西班牙语版本翻译得到加泰罗尼亚语文本。
- 数据量: 包含3.4M个片段。
支持的任务和排行榜
- 用途: 用于训练NMT(神经机器翻译)和SMT(统计机器翻译)系统。
- 应用实例: 作为训练语料库用于Softcatalà机器翻译引擎。
语言
- 源语言: 德语 (
de)。 - 目标语言: 加泰罗尼亚语 (
ca)。
数据集结构
数据实例
- 详情: 待补充。
数据字段
- 内容: 原始文本。
数据分割
- 分割方式: 每种语言一个文件。
数据集创建
来源数据
- 初始数据收集和标准化: 待补充。
- 源语言生产者: 待补充。
注释
- 注释过程: 待补充。
- 注释者: 待补充。
使用数据的考虑
社会影响
- 影响: 待补充。
偏见讨论
- 偏见: 待补充。
其他已知限制
- 限制: 待补充。
附加信息
数据集管理员
- 贡献者: @softcatala, @jordimas, @davidcanovas。
许可信息
- 许可: CC BY 4.0。
引用信息
- 引用: 待补充。



