five

locailabs/cofnodycynulliad_en_cy

收藏
Hugging Face2026-04-08 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/locailabs/cofnodycynulliad_en_cy
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - cy license: other task_categories: - translation - text-generation size_categories: - 10K<n<100K tags: - welsh - cymraeg - parallel-corpus - instruction-tuning - sft --- # Senedd Plenary Transcripts — Welsh–English SFT Dataset Processed instruction-tuning dataset derived from `techiaith/cofnodycynulliad_en-cy`, a Welsh–English parallel translation memory published by the Bangor University Language Technologies Unit (Techiaith). Formatted for supervised fine-tuning (SFT) of language models. ## Source | Field | Value | |-------|-------| | Source dataset | [`techiaith/cofnodycynulliad_en-cy`](https://huggingface.co/datasets/techiaith/cofnodycynulliad_en-cy) | | Domain | Senedd (Welsh Parliament) plenary transcripts | | Raw pairs | 104,738 | | Processed examples | 19,581 | | Licence | Open Government Licence v3.0 (OGL) | All translations were produced by professional translators working within Welsh public sector institutions. ## Format The dataset uses the messages chat format. Two example types are included. **Single-turn** (~70% of examples): ```json { "messages": [ { "role": "user", "content": "Translate the following English text into Welsh:\n\nThe application must be submitted before the deadline." }, { "role": "assistant", "content": "Rhaid cyflwyno'r cais cyn y dyddiad cau." } ], "source_dataset": "techiaith/cofnodycynulliad_en-cy" } ``` **Multi-turn** (~30% of examples): ```json { "messages": [ { "role": "user", "content": "I'd like you to translate a series of English sentences into Welsh. I'll give you one sentence at a time.\n\nThe committee has considered this matter in detail." }, { "role": "assistant", "content": "Mae'r pwyllgor wedi ystyried y mater hwn yn fanwl." }, { "role": "user", "content": "An amendment was proposed by the Member for Ynys Môn." }, { "role": "assistant", "content": "Cynigiwyd gwelliant gan yr Aelod dros Fôn." } ], "source_dataset": "techiaith/cofnodycynulliad_en-cy" } ``` **Fields:** - `messages`: Translation task in chat format - `source_dataset`: HuggingFace ID of the originating source corpus The dataset is balanced: ~50% English→Welsh and ~50% Welsh→English translations. Instruction prompts are drawn from a diverse pool of 21 template phrasings in both English and Welsh to reduce overfitting to a single prompt pattern. ## Curation Pipeline Raw pairs from the source dataset were processed through the following stages: 1. **Length filter** — pairs where either side is fewer than 20 characters are removed 2. **Artefact filter** — pairs containing URLs, emoji, bullet/list markers, or excessive repetition are removed 3. **Exact deduplication** — normalised string deduplication (NFC, lowercased, whitespace-collapsed) 4. **MinHash deduplication** — 1-gram MinHash LSH (128 permutations, Jaccard threshold 0.9) to remove near-identical surface-form variants 5. **Semantic deduplication** — embedding-based deduplication via SemHash (`minishlab/potion-multilingual-8M`, cosine threshold 0.85) to remove semantically equivalent pairs with different surface forms 6. **Instruction formatting** — conversion to chat format with template diversity and multi-turn conversation grouping ## Limitations All source data is drawn from formal institutional domains. The dataset covers formal Welsh well but underrepresents colloquial, spoken, and informal registers. Source translation memories may contain a small number of misaligned pairs that are not detectable without a dedicated quality scorer. ## Citation If you use this dataset, please cite the original Techiaith source: ```bibtex @misc{techiaith_cofnodycynulliad, author = {{Bangor University Language Technologies Unit (Techiaith)}}, title = {Cofnod y Cynulliad (Senedd Plenary Record) Welsh–English Translation Memory}, year = {2023}, publisher = {HuggingFace}, howpublished = {\url{https://huggingface.co/datasets/techiaith/cofnodycynulliad_en-cy}}, } ``` ## Acknowledgements - [Techiaith (Bangor University)](https://techiaith.cymru/) for producing and releasing the source translation memories - [Locai Labs](https://huggingface.co/locailabs) for the curation pipeline
提供机构:
locailabs
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作