oliverkinch/machine-translation-da-ar
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/oliverkinch/machine-translation-da-ar
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- da
- ar
license: cc0-1.0
task_categories:
- translation
tags:
- danish
- arabic
- opus
- parallel-corpus
- translation
pretty_name: Danish-Arabic Translation Dataset
size_categories:
- 1M<n<10M
source_datasets:
- opus
---
# Danish-Arabic Machine Translation Dataset
A Danish–Arabic parallel corpus assembled from multiple OPUS sources and filtered using heuristic quality rules.
## Dataset Summary
| Attribute | Value |
| ------------------------------- | ---------------------------- |
| Columns | `danish`, `arabic`, `source` |
| Language pair | Danish → Arabic |
| Sentence length range (Danish) | 4 – 150 words |
## Load the dataset
```python
from datasets import load_dataset
ds = load_dataset("oliverkinch/machine-translation-da-ar", split="train")
print(ds[0])
# {'danish': '...', 'arabic': '...', 'source': 'NLLB'}
```
## Data Fields
| Field | Type | Description |
| --------- | ------ | -------------------------- |
| `danish` | string | Source sentence in Danish |
| `arabic` | string | Target sentence in Arabic |
| `source` | string | Originating corpus name |
## Source Corpora
The dataset aggregates sentence pairs from the following OPUS corpora:
| Source | Domain |
| ---------------- | ------------ |
| WikiMatrix | Wikipedia |
| TED2020 | TED Talks |
| ELRC-wikipedia_health | Health |
| wikimedia | Wikipedia |
| NLLB | Web |
WikiMatrix and wikimedia align Wikipedia content across languages. TED2020 provides parallel transcripts of TED talks. ELRC-wikipedia_health covers health and COVID-19 related Wikipedia content. NLLB (No Language Left Behind) is a multilingual corpus released by Meta AI.
## Filtering Pipeline
Pairs were retained only if they passed all of the following heuristic filters:
| Filter | Criterion |
| ----------------------------- | --------- |
| Minimum words (Danish) | ≥ 4 |
| Minimum words (Arabic) | ≥ 3 |
| Maximum words (either side) | ≤ 150 |
| DA/AR word-count ratio | 0.4 – 2.5 |
| URLs in either field | Rejected |
| Non-Latin script in Danish | Rejected |
| Non-Arabic script in Arabic | Rejected |
| Exact deduplication (Danish) | Enabled |
## Intended Uses
* Training and fine-tuning machine translation (MT) models for the Danish–Arabic direction
* Data augmentation for low-resource Danish and Arabic NLP tasks
* Benchmarking cross-lingual representations
## Licensing
Individual source corpora carry their own licenses (primarily CC-BY-4.0 and CC-BY-SA-4.0). Please consult the [OPUS platform](https://opus.nlpl.eu) for the license of each sub-corpus before use in commercial settings.
## Citation
If you use this dataset, please cite the relevant OPUS sub-corpora. The OPUS platform is described in:
```bibtex
@inproceedings{tiedemann-2012-parallel,
title = {Parallel Data, Tools and Interfaces in {OPUS}},
author = {Tiedemann, J{\"o}rg},
booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation ({LREC}'12)},
year = {2012},
pages = {2214--2218}
}
```
提供机构:
oliverkinch



