five

oliverkinch/machine-translation-da-ar

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/oliverkinch/machine-translation-da-ar
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - da - ar license: cc0-1.0 task_categories: - translation tags: - danish - arabic - opus - parallel-corpus - translation pretty_name: Danish-Arabic Translation Dataset size_categories: - 1M<n<10M source_datasets: - opus --- # Danish-Arabic Machine Translation Dataset A Danish–Arabic parallel corpus assembled from multiple OPUS sources and filtered using heuristic quality rules. ## Dataset Summary | Attribute | Value | | ------------------------------- | ---------------------------- | | Columns | `danish`, `arabic`, `source` | | Language pair | Danish → Arabic | | Sentence length range (Danish) | 4 – 150 words | ## Load the dataset ```python from datasets import load_dataset ds = load_dataset("oliverkinch/machine-translation-da-ar", split="train") print(ds[0]) # {'danish': '...', 'arabic': '...', 'source': 'NLLB'} ``` ## Data Fields | Field | Type | Description | | --------- | ------ | -------------------------- | | `danish` | string | Source sentence in Danish | | `arabic` | string | Target sentence in Arabic | | `source` | string | Originating corpus name | ## Source Corpora The dataset aggregates sentence pairs from the following OPUS corpora: | Source | Domain | | ---------------- | ------------ | | WikiMatrix | Wikipedia | | TED2020 | TED Talks | | ELRC-wikipedia_health | Health | | wikimedia | Wikipedia | | NLLB | Web | WikiMatrix and wikimedia align Wikipedia content across languages. TED2020 provides parallel transcripts of TED talks. ELRC-wikipedia_health covers health and COVID-19 related Wikipedia content. NLLB (No Language Left Behind) is a multilingual corpus released by Meta AI. ## Filtering Pipeline Pairs were retained only if they passed all of the following heuristic filters: | Filter | Criterion | | ----------------------------- | --------- | | Minimum words (Danish) | ≥ 4 | | Minimum words (Arabic) | ≥ 3 | | Maximum words (either side) | ≤ 150 | | DA/AR word-count ratio | 0.4 – 2.5 | | URLs in either field | Rejected | | Non-Latin script in Danish | Rejected | | Non-Arabic script in Arabic | Rejected | | Exact deduplication (Danish) | Enabled | ## Intended Uses * Training and fine-tuning machine translation (MT) models for the Danish–Arabic direction * Data augmentation for low-resource Danish and Arabic NLP tasks * Benchmarking cross-lingual representations ## Licensing Individual source corpora carry their own licenses (primarily CC-BY-4.0 and CC-BY-SA-4.0). Please consult the [OPUS platform](https://opus.nlpl.eu) for the license of each sub-corpus before use in commercial settings. ## Citation If you use this dataset, please cite the relevant OPUS sub-corpora. The OPUS platform is described in: ```bibtex @inproceedings{tiedemann-2012-parallel, title = {Parallel Data, Tools and Interfaces in {OPUS}}, author = {Tiedemann, J{\"o}rg}, booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation ({LREC}'12)}, year = {2012}, pages = {2214--2218} } ```
提供机构:
oliverkinch
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作