five

oliverkinch/machine-translation-da-uk

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/oliverkinch/machine-translation-da-uk
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - da - uk license: cc0-1.0 task_categories: - translation tags: - danish - ukrainian - opus - parallel-corpus - translation pretty_name: Danish-Ukrainian Translation Dataset size_categories: - 1M<n<10M source_datasets: - opus --- # Danish-Ukrainian Machine Translation Dataset A Danish–Ukrainian parallel corpus assembled from multiple OPUS sources and filtered using heuristic quality rules. ## Dataset Summary | Attribute | Value | | ------------------------------- | ------------------------------ | | Columns | `danish`, `ukrainian`, `source` | | Language pair | Danish → Ukrainian | | Sentence length range (Danish) | 4 – 150 words | ## Load the dataset ```python from datasets import load_dataset ds = load_dataset("oliverkinch/machine-translation-da-uk", split="train") print(ds[0]) # {'danish': '...', 'ukrainian': '...', 'source': 'NLLB'} ``` ## Data Fields | Field | Type | Description | | ----------- | ------ | ---------------------------- | | `danish` | string | Source sentence in Danish | | `ukrainian` | string | Target sentence in Ukrainian | | `source` | string | Originating corpus name | ## Source Corpora The dataset aggregates sentence pairs from the following OPUS corpora: | Source | Domain | | --------------------------- | ----------------- | | ELRC-5179-acts_Ukrainian | EU legal acts | | WikiMatrix | Wikipedia | | ELRC-wikipedia_health | Health | | TED2020 | TED Talks | | wikimedia | Wikipedia | | NLLB | Multilingual web | ELRC-5179-acts_Ukrainian contains EU legislative documents translated by professional translators. WikiMatrix and wikimedia derive from Wikipedia content aligned across languages. TED2020 provides parallel transcripts of TED talks. NLLB (No Language Left Behind) is a multilingual corpus released by Meta AI. ## Filtering Pipeline Pairs were retained only if they passed all of the following heuristic filters: | Filter | Criterion | | ----------------------------- | --------- | | Minimum words (Danish) | ≥ 4 | | Minimum words (Ukrainian) | ≥ 3 | | Maximum words (either side) | ≤ 150 | | DA/UK word-count ratio | 0.4 – 2.5 | | URLs in either field | Rejected | | Non-Latin script in Danish | Rejected | | Non-Cyrillic Ukrainian text | Rejected | | Exact deduplication (Danish) | Enabled | ## Intended Uses * Training and fine-tuning machine translation (MT) models for the Danish–Ukrainian direction * Data augmentation for low-resource Danish and Ukrainian NLP tasks * Benchmarking cross-lingual representations ## Licensing Individual source corpora carry their own licenses (primarily CC-BY-4.0 and CC-BY-SA-4.0). Please consult the [OPUS platform](https://opus.nlpl.eu) for the license of each sub-corpus before use in commercial settings. ## Citation If you use this dataset, please cite the relevant OPUS sub-corpora. The OPUS platform is described in: ```bibtex @inproceedings{tiedemann-2012-parallel, title = {Parallel Data, Tools and Interfaces in {OPUS}}, author = {Tiedemann, J{\"o}rg}, booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation ({LREC}'12)}, year = {2012}, pages = {2214--2218} } ```
提供机构:
oliverkinch
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作