five

locailabs/wikimedia_welsh

收藏
Hugging Face2026-02-18 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/locailabs/wikimedia_welsh
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - cy license: apache-2.0 task_categories: - translation - question-answering size_categories: - 10K<n<100K --- # 🏴󠁧󠁢󠁷󠁬󠁳󠁿🇬🇧 Welsh-English Wikimedia Translation Dataset Part of the Welsh parallel corpora collection. Contains 83,796 Welsh-English translation pairs in chat format. Please find a blog on the data curation process [here](https://locailabs.com/blog/curating-a-welsh-english-translation-dataset-for-language-models). ## Dataset Description This dataset provides Welsh-English translation pairs from Wikimedia. Wikipedia translations from Wikimedia Foundation's article translation system (combined v20210402 and v20230407). The data has been processed through a multi-stage quality pipeline and formatted for instruction-based fine-tuning. ## Format Each entry is in messages format with balanced bidirectional translations (~50% English→Welsh, ~50% Welsh→English). ## Source Data sourced from [OPUS](https://opus.nlpl.eu/) - Wikimedia. Original source: https://dumps.wikimedia.org/other/contenttranslation/ ## Processing Pipeline 1. **Length Filtering**: Removed pairs < 20 characters 2. **Semantic Deduplication**: MinHash LSH with multilingual embeddings (threshold: 0.85) 3. **Quality Filtering**: Removed URLs, emojis, and excessive repetition 4. **Bidirectional Balancing**: Equal representation of both translation directions ## Citation ## Related Datasets - [locailabs/welsh_parallel_corpora](https://huggingface.co/datasets/locailabs/welsh_parallel_corpora) - Combined dataset
提供机构:
locailabs
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作