five

oliverkinch/machine-translation-da-en

收藏
Hugging Face2026-04-17 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/oliverkinch/machine-translation-da-en
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - da - en license: cc0-1.0 task_categories: - translation tags: - danish - english - opus - parallel-corpus - translation pretty_name: Danish-English Translation Dataset size_categories: - 10M<n<100M source_datasets: - opus --- # Danish-English Machine Translation Dataset A high-quality Danish–English parallel corpus with **13,047,991 sentence pairs**, assembled from multiple OPUS sources and filtered using heuristic quality rules. ## Dataset Summary | Attribute | Value | | ------------------------------ | ----------------------------- | | Pairs | 13,047,991 | | Columns | `danish`, `english`, `source` | | Language pair | Danish → English | | Avg. Danish sentence length | ~22 words | | Avg. English sentence length | ~24 words | | Sentence length range (Danish) | 4 – 150 words | ## Load the dataset ```python from datasets import load_dataset ds = load_dataset("oliverkinch/machine-translation-da-en", split="train") print(ds[0]) # {'danish': '...', 'english': '...', 'source': 'DGT'} ``` ## Data Fields | Field | Type | Description | | --------- | ------ | -------------------------- | | `danish` | string | Source sentence in Danish | | `english` | string | Target sentence in English | | `source` | string | Originating corpus name | ## Source Corpora The dataset aggregates sentence pairs from the following OPUS corpora: | Source | Pairs | Share | | ---------------------- | --------- | ----- | | ELRC-4248-NTEU_TierA | 9,583,853 | 73.5% | | DGT | 1,517,756 | 11.6% | | ELRC-EMEA | 651,967 | 5.0% | | WikiMatrix | 413,355 | 3.2% | | Europarl | 217,004 | 1.7% | | LinguaTools-WikiTitles | 206,836 | 1.6% | | ELITR-ECA | 106,988 | 0.8% | | ECB | 99,018 | 0.8% | | KDE4 | 78,630 | 0.6% | | wikimedia | 56,638 | 0.4% | | Other (40 corpora) | ~115,946 | 0.9% | The dominant source (ELRC-4248-NTEU_TierA) and DGT are EU professional translation memories, consisting of human translations by certified EU translators for legally binding documents. Europarl contains proceedings of the European Parliament. WikiMatrix and wikimedia derive from Wikipedia content aligned across languages. ## Filtering Pipeline Pairs were retained only if they passed all of the following heuristic filters: | Filter | Criterion | | -------------------------- | --------- | | Minimum words (Danish) | ≥ 4 | | Maximum words (Danish) | ≤ 150 | | DA/EN word-count ratio | 0.4 – 2.5 | | Identical DA == EN | Rejected | | Digit fraction in Danish | ≤ 30% | | URLs in either field | Rejected | | Non-Latin script in Danish | Rejected | | Exact deduplication | Enabled | Approximately **35% of raw pairs passed these filters** from an input of ~29M pairs across all sources. ## Intended Uses * Training and fine-tuning machine translation (MT) models for the Danish–English direction * Data augmentation for low-resource Danish NLP tasks * Benchmarking cross-lingual representations * Building Danish language resources ## Licensing Individual source corpora carry their own licenses (primarily CC0, CC-BY, or open government licenses). Please consult the OPUS platform for the license of each sub-corpus before use in commercial settings. ## Citation If you use this dataset, please cite the relevant OPUS sub-corpora. The OPUS platform is described in: ```bibtex @inproceedings{tiedemann-2012-parallel, title = {Parallel Data, Tools and Interfaces in {OPUS}}, author = {Tiedemann, J{\"o}rg}, booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation ({LREC}'12)}, year = {2012}, pages = {2214--2218} } ```
提供机构:
oliverkinch
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作