serhanayberkkilic/physiotherapy-evidence-qa
收藏Hugging Face2025-11-22 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/serhanayberkkilic/physiotherapy-evidence-qa
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- tr
- en
license: cc-by-4.0
task_categories:
- question-answering
- text-generation
- translation
tags:
- medical
- rag
- medical-rag
- llm
- fine-tuning
- question-answering
- bilingual
- turkish
pretty_name: Physiotherapy Evidence QA
size_categories:
- 100K<n<1M
configs:
- config_name: default
data_files:
- split: train
path: "*.csv"
---
# 🏥 Physiotherapy Evidence QA: A Bilingual Clinical Corpus
**Physiotherapy Evidence QA** is a large-scale, expert-curated bilingual dataset comprising **143,711** aligned question-answer pairs. It focuses on evidence-based physiotherapy, musculoskeletal rehabilitation, outcome measures, and clinical research methodology.
This corpus is designed to facilitate the development of **Medical Large Language Models (Med-LLMs)**, **Clinical Decision Support Systems (CDSS)**, and **Cross-Lingual Information Retrieval** systems specifically for the physical therapy domain.
**Principal Investigators / Curators:**
* **Serhan Ayberk KILIÇ** (Senior AI Engineer)
* **Fatma Betül DERDİYOK** (PhD Student)
* **Kasım SERBEST** (Associate Professor)
---
## 📖 Dataset Description
The dataset provides a structured taxonomy of clinical knowledge across musculoskeletal rehabilitation, physiotherapy interventions, and evidence-based treatment protocols. It bridges the gap between raw clinical literature and structured instruction-tuning data for medical LLM applications.
**Data Generation Methodology:**
- Question-answer pairs were generated using **distillation techniques** from approximately **3,000 scientific sources**
- Sources include peer-reviewed articles, clinical textbooks, and academic theses
- All references were carefully selected by **domain experts** in physiotherapy and rehabilitation
- Each QA pair is traceable to its original source via `source_file` and `source_page` metadata
### Key Features
* **Bilingual Alignment:** Every entry contains semantically aligned pairs in **Turkish** and **English**, enabling cross-lingual transfer learning.
* **Granular Metadata:** Each record is tagged with difficulty levels, question types (e.g., *Diagnosis, Treatment, Etiology*), and specific disease categories.
* **Evidence-Based:** Responses are derived from vetted clinical literature, ensuring high reliability for medical applications.
---
## 🔬 Usage & Research Applications
This dataset is optimized for three primary LLM tasks in the healthcare domain:
**Key Use Cases:**
- Training domain-specific medical LLMs
- Building RAG systems for clinical decision support
- Developing bilingual medical AI assistants
- Benchmarking cross-lingual medical understanding
### 1. Domain-Specific Supervised Fine-Tuning (SFT)
Researchers can use this dataset to fine-tune general-purpose LLMs to improve their proficiency in physiotherapy terminology and clinical reasoning.
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("serhanayberkkilic/physiotherapy-evidence-qa")
# Example: Formatting data for Instruction Tuning (SFT)
def format_instruction(sample):
return {
"prompt": f"User: {sample['question_en']}\n\nAssistant: {sample['answer_en']}",
"category": sample['disease_category_en']
}
formatted_data = dataset['train'].map(format_instruction)
print(formatted_data[0]['prompt'])
```
### 2. Retrieval-Augmented Generation (RAG) Benchmarking
The dataset serves as a gold-standard knowledge base for testing RAG pipelines. The \`source_file\` and \`source_page\` columns allow researchers to verify citation accuracy and retrieval precision.
### 3. Cross-Lingual Medical Alignment
Given the parallel nature of the data (\`question_tr\` vs. \`question_en\`), this corpus is suitable for training translation models specialized in medical syntax or evaluating the zero-shot performance of English-centric models on Turkish medical queries.
---
## 📊 Data Schema
The dataset is distributed in **CSV** format. The schema ensures full bilingual coverage for metadata:
| Column Name | Type | Description |
|:---|:---|:---|
| question_tr | String | Clinical inquiry in **Turkish**. |
| answer_tr | String | Evidence-based response in **Turkish**. |
| question_en | String | Clinical inquiry in **English** (Semantic translation). |
| answer_en | String | Evidence-based response in **English**. |
| disease_category_tr | Categorical | The specific pathology or topic in Turkish. |
| disease_category_en | Categorical | The specific pathology or topic in English (e.g., *Lateral Epicondylitis*). |
| question_type_tr | Categorical | Classification of the query in Turkish. |
| question_type_en | Categorical | Classification of the query in English (e.g., *Anatomy, Diagnosis*). |
| difficulty_tr | Ordinal | Difficulty level in Turkish (*Kolay, Orta, Zor*). |
| difficulty_en | Ordinal | Difficulty level in English (*Easy, Medium, Hard*). |
| keywords_tr | List | Domain-specific terminology in Turkish. |
| keywords_en | List | Domain-specific terminology in English. |
| source_file | String | The original academic reference document. |
| source_page | Integer | Page index in the source document for citation verification. |
---
## 📚 Citation
Use this BibTeX to cite the repository until the research paper is published:
```bibtex
@misc{kilic2025physiotherapyqa,
author = {Kılıç, Serhan Ayberk and Derdiyok, Fatma Betül and Serbest, Kasım},
title = {Physiotherapy Evidence QA: A Bilingual Clinical Corpus},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face Repository},
howpublished = {\url{https://huggingface.co/datasets/serhanayberkkilic/physiotherapy-evidence-qa}},
note = {dataset}
}
```
---
## ⚖️ License & Disclaimer
**License:** This dataset is licensed under the **Creative Commons Attribution 4.0 International (CC BY 4.0)** license.
**Disclaimer:** While this dataset is curated by professionals, it is intended for **research and educational purposes only**. It should not be used as a substitute for professional medical advice, diagnosis, or treatment. AI models trained on this data should undergo rigorous safety testing before clinical deployment.
提供机构:
serhanayberkkilic



