locailabs/cofnodycynulliad_en_cy
收藏Hugging Face2026-04-08 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/locailabs/cofnodycynulliad_en_cy
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- cy
license: other
task_categories:
- translation
- text-generation
size_categories:
- 10K<n<100K
tags:
- welsh
- cymraeg
- parallel-corpus
- instruction-tuning
- sft
---
# Senedd Plenary Transcripts — Welsh–English SFT Dataset
Processed instruction-tuning dataset derived from `techiaith/cofnodycynulliad_en-cy`, a Welsh–English parallel translation memory published by the Bangor University Language Technologies Unit (Techiaith). Formatted for supervised fine-tuning (SFT) of language models.
## Source
| Field | Value |
|-------|-------|
| Source dataset | [`techiaith/cofnodycynulliad_en-cy`](https://huggingface.co/datasets/techiaith/cofnodycynulliad_en-cy) |
| Domain | Senedd (Welsh Parliament) plenary transcripts |
| Raw pairs | 104,738 |
| Processed examples | 19,581 |
| Licence | Open Government Licence v3.0 (OGL) |
All translations were produced by professional translators working within Welsh public sector institutions.
## Format
The dataset uses the messages chat format. Two example types are included.
**Single-turn** (~70% of examples):
```json
{
"messages": [
{
"role": "user",
"content": "Translate the following English text into Welsh:\n\nThe application must be submitted before the deadline."
},
{
"role": "assistant",
"content": "Rhaid cyflwyno'r cais cyn y dyddiad cau."
}
],
"source_dataset": "techiaith/cofnodycynulliad_en-cy"
}
```
**Multi-turn** (~30% of examples):
```json
{
"messages": [
{
"role": "user",
"content": "I'd like you to translate a series of English sentences into Welsh. I'll give you one sentence at a time.\n\nThe committee has considered this matter in detail."
},
{
"role": "assistant",
"content": "Mae'r pwyllgor wedi ystyried y mater hwn yn fanwl."
},
{
"role": "user",
"content": "An amendment was proposed by the Member for Ynys Môn."
},
{
"role": "assistant",
"content": "Cynigiwyd gwelliant gan yr Aelod dros Fôn."
}
],
"source_dataset": "techiaith/cofnodycynulliad_en-cy"
}
```
**Fields:**
- `messages`: Translation task in chat format
- `source_dataset`: HuggingFace ID of the originating source corpus
The dataset is balanced: ~50% English→Welsh and ~50% Welsh→English translations. Instruction prompts are drawn from a diverse pool of 21 template phrasings in both English and Welsh to reduce overfitting to a single prompt pattern.
## Curation Pipeline
Raw pairs from the source dataset were processed through the following stages:
1. **Length filter** — pairs where either side is fewer than 20 characters are removed
2. **Artefact filter** — pairs containing URLs, emoji, bullet/list markers, or excessive repetition are removed
3. **Exact deduplication** — normalised string deduplication (NFC, lowercased, whitespace-collapsed)
4. **MinHash deduplication** — 1-gram MinHash LSH (128 permutations, Jaccard threshold 0.9) to remove near-identical surface-form variants
5. **Semantic deduplication** — embedding-based deduplication via SemHash (`minishlab/potion-multilingual-8M`, cosine threshold 0.85) to remove semantically equivalent pairs with different surface forms
6. **Instruction formatting** — conversion to chat format with template diversity and multi-turn conversation grouping
## Limitations
All source data is drawn from formal institutional domains. The dataset covers formal Welsh well but underrepresents colloquial, spoken, and informal registers. Source translation memories may contain a small number of misaligned pairs that are not detectable without a dedicated quality scorer.
## Citation
If you use this dataset, please cite the original Techiaith source:
```bibtex
@misc{techiaith_cofnodycynulliad,
author = {{Bangor University Language Technologies Unit (Techiaith)}},
title = {Cofnod y Cynulliad (Senedd Plenary Record) Welsh–English Translation Memory},
year = {2023},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/datasets/techiaith/cofnodycynulliad_en-cy}},
}
```
## Acknowledgements
- [Techiaith (Bangor University)](https://techiaith.cymru/) for producing and releasing the source translation memories
- [Locai Labs](https://huggingface.co/locailabs) for the curation pipeline
提供机构:
locailabs



