oliverkinch/machine-translation-da-en
收藏Hugging Face2026-04-17 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/oliverkinch/machine-translation-da-en
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- da
- en
license: cc0-1.0
task_categories:
- translation
tags:
- danish
- english
- opus
- parallel-corpus
- translation
pretty_name: Danish-English Translation Dataset
size_categories:
- 10M<n<100M
source_datasets:
- opus
---
# Danish-English Machine Translation Dataset
A high-quality Danish–English parallel corpus with **13,047,991 sentence pairs**, assembled from multiple OPUS sources and filtered using heuristic quality rules.
## Dataset Summary
| Attribute | Value |
| ------------------------------ | ----------------------------- |
| Pairs | 13,047,991 |
| Columns | `danish`, `english`, `source` |
| Language pair | Danish → English |
| Avg. Danish sentence length | ~22 words |
| Avg. English sentence length | ~24 words |
| Sentence length range (Danish) | 4 – 150 words |
## Load the dataset
```python
from datasets import load_dataset
ds = load_dataset("oliverkinch/machine-translation-da-en", split="train")
print(ds[0])
# {'danish': '...', 'english': '...', 'source': 'DGT'}
```
## Data Fields
| Field | Type | Description |
| --------- | ------ | -------------------------- |
| `danish` | string | Source sentence in Danish |
| `english` | string | Target sentence in English |
| `source` | string | Originating corpus name |
## Source Corpora
The dataset aggregates sentence pairs from the following OPUS corpora:
| Source | Pairs | Share |
| ---------------------- | --------- | ----- |
| ELRC-4248-NTEU_TierA | 9,583,853 | 73.5% |
| DGT | 1,517,756 | 11.6% |
| ELRC-EMEA | 651,967 | 5.0% |
| WikiMatrix | 413,355 | 3.2% |
| Europarl | 217,004 | 1.7% |
| LinguaTools-WikiTitles | 206,836 | 1.6% |
| ELITR-ECA | 106,988 | 0.8% |
| ECB | 99,018 | 0.8% |
| KDE4 | 78,630 | 0.6% |
| wikimedia | 56,638 | 0.4% |
| Other (40 corpora) | ~115,946 | 0.9% |
The dominant source (ELRC-4248-NTEU_TierA) and DGT are EU professional translation memories, consisting of human translations by certified EU translators for legally binding documents. Europarl contains proceedings of the European Parliament. WikiMatrix and wikimedia derive from Wikipedia content aligned across languages.
## Filtering Pipeline
Pairs were retained only if they passed all of the following heuristic filters:
| Filter | Criterion |
| -------------------------- | --------- |
| Minimum words (Danish) | ≥ 4 |
| Maximum words (Danish) | ≤ 150 |
| DA/EN word-count ratio | 0.4 – 2.5 |
| Identical DA == EN | Rejected |
| Digit fraction in Danish | ≤ 30% |
| URLs in either field | Rejected |
| Non-Latin script in Danish | Rejected |
| Exact deduplication | Enabled |
Approximately **35% of raw pairs passed these filters** from an input of ~29M pairs across all sources.
## Intended Uses
* Training and fine-tuning machine translation (MT) models for the Danish–English direction
* Data augmentation for low-resource Danish NLP tasks
* Benchmarking cross-lingual representations
* Building Danish language resources
## Licensing
Individual source corpora carry their own licenses (primarily CC0, CC-BY, or open government licenses). Please consult the OPUS platform for the license of each sub-corpus before use in commercial settings.
## Citation
If you use this dataset, please cite the relevant OPUS sub-corpora. The OPUS platform is described in:
```bibtex
@inproceedings{tiedemann-2012-parallel,
title = {Parallel Data, Tools and Interfaces in {OPUS}},
author = {Tiedemann, J{\"o}rg},
booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation ({LREC}'12)},
year = {2012},
pages = {2214--2218}
}
```
提供机构:
oliverkinch



