five

locailabs/opensubtitles_welsh

收藏
Hugging Face2026-02-18 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/locailabs/opensubtitles_welsh
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - cy license: apache-2.0 task_categories: - translation - question-answering - text-generation size_categories: - 100K<n<1M --- # 🏴󠁧󠁢󠁷󠁬󠁳󠁿🇬🇧 Welsh-English OpenSubtitles Translation Dataset A curated bidirectional translation dataset containing 235K+ Welsh-English parallel sentences in chat format, designed for fine-tuning language models on low-resource language translation. Please find a blog on the data curation process [here](https://locailabs.com/blog/curating-a-welsh-english-translation-dataset-for-language-models). ## Dataset Description This dataset provides Welsh-English translation pairs extracted from movie and TV subtitles. Welsh (Cymraeg) is a low-resource language with limited parallel corpora available for training neural translation models. The data has been processed through a multi-stage quality pipeline and formatted for instruction-based fine-tuning. ### Format Each entry is in OpenAI chat format: ```json { "messages": [ { "role": "user", "content": "Translate the following English text into Welsh:\n\n[source text]" }, { "role": "assistant", "content": "[translated text]" } ] } ``` The dataset is balanced: ~50% English→Welsh and ~50% Welsh→English translations. ## Data Collection and Processing ### Source Data Parallel sentences extracted from [OpenSubtitles](http://www.opensubtitles.org/) corpus via [OPUS](https://opus.nlpl.eu/OpenSubtitles-v2024.php) (v2024 release). ### Curation Pipeline **Stage 1: Core Processing** 1. **Length Filtering**: Removed sentence pairs where the text contains fewer than 20 characters 2. **Semantic Deduplication**: Applied MinHash LSH-based deduplication using multilingual sentence embeddings (`paraphrase-multilingual-MiniLM-L12-v2`) with similarity threshold of 0.85 3. **Bidirectional Balancing**: Randomly partitioned pairs to achieve equal representation of both translation directions **Stage 2: Quality Filtering** - Removed pairs containing URLs - Removed pairs containing emojis from subtitle formatting artifacts - Removed pairs with excessive character or word repetition ## Limitations - Source data consists of subtitle text, which contains informal dialogue, colloquialisms, and incomplete sentences typical of spoken language ## Citation Please cite the original OpenSubtitles corpus: ```bibtex @inproceedings{lison-tiedemann-2016-opensubtitles2016, title = "{O}pen{S}ubtitles2016: Extracting Large Parallel Corpora from Movie and {TV} Subtitles", author = "Lison, Pierre and Tiedemann, J{\"o}rg", booktitle = "Proceedings of LREC 2016", year = "2016", } ``` ## Acknowledgments - [OpenSubtitles](http://www.opensubtitles.org/) community - [OPUS project](https://opus.nlpl.eu/) for data access **Note:** Please link to http://www.opensubtitles.org/ in any publications using this data.
提供机构:
locailabs
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作