JMasr/balidea-medquad-qa-gl
收藏Hugging Face2026-03-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/JMasr/balidea-medquad-qa-gl
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- gl
license: apache-2.0
task_categories:
- text-classification
- question-answering
tags:
- medical
- translation
- galician
- medquad
- synthetic
size_categories:
- 10K<n<100K
---
# balidea-medquad-qa-gl
Galician translation of the [MedQuAD](https://github.com/abachaa/MedQuAD) medical question-answering dataset. Contains 16,407 question-answer pairs translated from English to Galician using an automatic translation pipeline.
## Dataset splits
| Split | Rows |
|-------|-----:|
| train | 13,125 |
| validation | 1,641 |
| test | 1,641 |
| **Total** | **16,407** |
## Features
- `text` — Medical question or answer in Galician (`string`)
- `labels` — Classification label (`int64`)
## Translation pipeline
Translations were produced by a dual-engine pipeline designed to balance speed, fluency, and domain accuracy:
1. **Dual-engine translation** — Each sentence is translated in parallel by two models:
- [Helsinki-NLP OPUS-MT](https://huggingface.co/Helsinki-NLP) — a lightweight, fast neural MT model
- TranslateGemma 12B — a large language model with stronger contextual understanding
2. **Arbitration** — A second TranslateGemma 12B instance acts as an arbitrator, comparing both outputs and synthesizing a final translation that prioritises natural fluency and adherence to [ILG-RAG](https://ilg.usc.gal/) orthographic standards for Galician.
3. **Quality scoring** — Every translation is scored with [COMETKiwi](https://huggingface.co/Unbabel/wmt22-cometkiwi-da) (reference-free quality estimation). Sentences below the quality threshold are flagged for manual review.
## Source dataset
Based on [MedQuAD](https://github.com/abachaa/MedQuAD) (Medical Question Answering Dataset) by Ben Abacha and Demner-Fushman (2019), covering 37 medical categories sourced from NIH websites.
提供机构:
JMasr



