locailabs/eubookshop_welsh
收藏Hugging Face2026-02-18 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/locailabs/eubookshop_welsh
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- cy
license: apache-2.0
task_categories:
- translation
- question-answering
size_categories:
- 10K<n<100K
---
# 🏴🇬🇧 Welsh-English EUbookshop Translation Dataset
Part of the Welsh parallel corpora collection. Contains 2,124 Welsh-English translation pairs in chat format.
Please find a blog on the data curation process [here](https://locailabs.com/blog/curating-a-welsh-english-translation-dataset-for-language-models).
## Dataset Description
This dataset provides Welsh-English translation pairs from EUbookshop. Corpus of documents from the EU bookshop. The data has been processed through a multi-stage quality pipeline and formatted for instruction-based fine-tuning.
## Format
Each entry is in OpenAI chat format with balanced bidirectional translations (~50% English→Welsh, ~50% Welsh→English).
## Source
Data sourced from [OPUS](https://opus.nlpl.eu/) - EUbookshop.
Original source: http://bookshop.europa.eu
## Processing Pipeline
1. **Length Filtering**: Removed pairs < 20 characters
2. **Semantic Deduplication**: MinHash LSH with multilingual embeddings (threshold: 0.85)
3. **Quality Filtering**: Removed URLs, emojis, and excessive repetition
4. **Bidirectional Balancing**: Equal representation of both translation directions
## Citation
## Related Datasets
- [locailabs/welsh_parallel_corpora](https://huggingface.co/datasets/locailabs/welsh_parallel_corpora) - Combined dataset
提供机构:
locailabs



