locailabs/opensubtitles_welsh
收藏Hugging Face2026-02-18 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/locailabs/opensubtitles_welsh
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- cy
license: apache-2.0
task_categories:
- translation
- question-answering
- text-generation
size_categories:
- 100K<n<1M
---
# 🏴🇬🇧 Welsh-English OpenSubtitles Translation Dataset
A curated bidirectional translation dataset containing 235K+ Welsh-English parallel sentences in chat format, designed for fine-tuning language models on low-resource language translation.
Please find a blog on the data curation process [here](https://locailabs.com/blog/curating-a-welsh-english-translation-dataset-for-language-models).
## Dataset Description
This dataset provides Welsh-English translation pairs extracted from movie and TV subtitles. Welsh (Cymraeg) is a low-resource language with limited parallel corpora available for training neural translation models. The data has been processed through a multi-stage quality pipeline and formatted for instruction-based fine-tuning.
### Format
Each entry is in OpenAI chat format:
```json
{
"messages": [
{
"role": "user",
"content": "Translate the following English text into Welsh:\n\n[source text]"
},
{
"role": "assistant",
"content": "[translated text]"
}
]
}
```
The dataset is balanced: ~50% English→Welsh and ~50% Welsh→English translations.
## Data Collection and Processing
### Source Data
Parallel sentences extracted from [OpenSubtitles](http://www.opensubtitles.org/) corpus via [OPUS](https://opus.nlpl.eu/OpenSubtitles-v2024.php) (v2024 release).
### Curation Pipeline
**Stage 1: Core Processing**
1. **Length Filtering**: Removed sentence pairs where the text contains fewer than 20 characters
2. **Semantic Deduplication**: Applied MinHash LSH-based deduplication using multilingual sentence embeddings (`paraphrase-multilingual-MiniLM-L12-v2`) with similarity threshold of 0.85
3. **Bidirectional Balancing**: Randomly partitioned pairs to achieve equal representation of both translation directions
**Stage 2: Quality Filtering**
- Removed pairs containing URLs
- Removed pairs containing emojis from subtitle formatting artifacts
- Removed pairs with excessive character or word repetition
## Limitations
- Source data consists of subtitle text, which contains informal dialogue, colloquialisms, and incomplete sentences typical of spoken language
## Citation
Please cite the original OpenSubtitles corpus:
```bibtex
@inproceedings{lison-tiedemann-2016-opensubtitles2016,
title = "{O}pen{S}ubtitles2016: Extracting Large Parallel Corpora from Movie and {TV} Subtitles",
author = "Lison, Pierre and Tiedemann, J{\"o}rg",
booktitle = "Proceedings of LREC 2016",
year = "2016",
}
```
## Acknowledgments
- [OpenSubtitles](http://www.opensubtitles.org/) community
- [OPUS project](https://opus.nlpl.eu/) for data access
**Note:** Please link to http://www.opensubtitles.org/ in any publications using this data.
提供机构:
locailabs



