five

JMasr/balidea-medquad-qa-gl

收藏
Hugging Face2026-03-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/JMasr/balidea-medquad-qa-gl
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - gl license: apache-2.0 task_categories: - text-classification - question-answering tags: - medical - translation - galician - medquad - synthetic size_categories: - 10K<n<100K --- # balidea-medquad-qa-gl Galician translation of the [MedQuAD](https://github.com/abachaa/MedQuAD) medical question-answering dataset. Contains 16,407 question-answer pairs translated from English to Galician using an automatic translation pipeline. ## Dataset splits | Split | Rows | |-------|-----:| | train | 13,125 | | validation | 1,641 | | test | 1,641 | | **Total** | **16,407** | ## Features - `text` — Medical question or answer in Galician (`string`) - `labels` — Classification label (`int64`) ## Translation pipeline Translations were produced by a dual-engine pipeline designed to balance speed, fluency, and domain accuracy: 1. **Dual-engine translation** — Each sentence is translated in parallel by two models: - [Helsinki-NLP OPUS-MT](https://huggingface.co/Helsinki-NLP) — a lightweight, fast neural MT model - TranslateGemma 12B — a large language model with stronger contextual understanding 2. **Arbitration** — A second TranslateGemma 12B instance acts as an arbitrator, comparing both outputs and synthesizing a final translation that prioritises natural fluency and adherence to [ILG-RAG](https://ilg.usc.gal/) orthographic standards for Galician. 3. **Quality scoring** — Every translation is scored with [COMETKiwi](https://huggingface.co/Unbabel/wmt22-cometkiwi-da) (reference-free quality estimation). Sentences below the quality threshold are flagged for manual review. ## Source dataset Based on [MedQuAD](https://github.com/abachaa/MedQuAD) (Medical Question Answering Dataset) by Ben Abacha and Demner-Fushman (2019), covering 37 medical categories sourced from NIH websites.
提供机构:
JMasr
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作