serhanayberkkilic/physiotherapy-evidence-qa

Name: serhanayberkkilic/physiotherapy-evidence-qa
Creator: serhanayberkkilic
Published: 2025-11-22 18:33:37
License: 暂无描述

Hugging Face2025-11-22 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/serhanayberkkilic/physiotherapy-evidence-qa

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - tr - en license: cc-by-4.0 task_categories: - question-answering - text-generation - translation tags: - medical - rag - medical-rag - llm - fine-tuning - question-answering - bilingual - turkish pretty_name: Physiotherapy Evidence QA size_categories: - 100K<n<1M configs: - config_name: default data_files: - split: train path: "*.csv" --- # 🏥 Physiotherapy Evidence QA: A Bilingual Clinical Corpus **Physiotherapy Evidence QA** is a large-scale, expert-curated bilingual dataset comprising **143,711** aligned question-answer pairs. It focuses on evidence-based physiotherapy, musculoskeletal rehabilitation, outcome measures, and clinical research methodology. This corpus is designed to facilitate the development of **Medical Large Language Models (Med-LLMs)**, **Clinical Decision Support Systems (CDSS)**, and **Cross-Lingual Information Retrieval** systems specifically for the physical therapy domain. **Principal Investigators / Curators:** * **Serhan Ayberk KILIÇ** (Senior AI Engineer) * **Fatma Betül DERDİYOK** (PhD Student) * **Kasım SERBEST** (Associate Professor) --- ## 📖 Dataset Description The dataset provides a structured taxonomy of clinical knowledge across musculoskeletal rehabilitation, physiotherapy interventions, and evidence-based treatment protocols. It bridges the gap between raw clinical literature and structured instruction-tuning data for medical LLM applications. **Data Generation Methodology:** - Question-answer pairs were generated using **distillation techniques** from approximately **3,000 scientific sources** - Sources include peer-reviewed articles, clinical textbooks, and academic theses - All references were carefully selected by **domain experts** in physiotherapy and rehabilitation - Each QA pair is traceable to its original source via `source_file` and `source_page` metadata ### Key Features * **Bilingual Alignment:** Every entry contains semantically aligned pairs in **Turkish** and **English**, enabling cross-lingual transfer learning. * **Granular Metadata:** Each record is tagged with difficulty levels, question types (e.g., *Diagnosis, Treatment, Etiology*), and specific disease categories. * **Evidence-Based:** Responses are derived from vetted clinical literature, ensuring high reliability for medical applications. --- ## 🔬 Usage & Research Applications This dataset is optimized for three primary LLM tasks in the healthcare domain: **Key Use Cases:** - Training domain-specific medical LLMs - Building RAG systems for clinical decision support - Developing bilingual medical AI assistants - Benchmarking cross-lingual medical understanding ### 1. Domain-Specific Supervised Fine-Tuning (SFT) Researchers can use this dataset to fine-tune general-purpose LLMs to improve their proficiency in physiotherapy terminology and clinical reasoning. ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("serhanayberkkilic/physiotherapy-evidence-qa") # Example: Formatting data for Instruction Tuning (SFT) def format_instruction(sample): return { "prompt": f"User: {sample['question_en']}\n\nAssistant: {sample['answer_en']}", "category": sample['disease_category_en'] } formatted_data = dataset['train'].map(format_instruction) print(formatted_data[0]['prompt']) ``` ### 2. Retrieval-Augmented Generation (RAG) Benchmarking The dataset serves as a gold-standard knowledge base for testing RAG pipelines. The \`source_file\` and \`source_page\` columns allow researchers to verify citation accuracy and retrieval precision. ### 3. Cross-Lingual Medical Alignment Given the parallel nature of the data (\`question_tr\` vs. \`question_en\`), this corpus is suitable for training translation models specialized in medical syntax or evaluating the zero-shot performance of English-centric models on Turkish medical queries. --- ## 📊 Data Schema The dataset is distributed in **CSV** format. The schema ensures full bilingual coverage for metadata: | Column Name | Type | Description | |:---|:---|:---| | question_tr | String | Clinical inquiry in **Turkish**. | | answer_tr | String | Evidence-based response in **Turkish**. | | question_en | String | Clinical inquiry in **English** (Semantic translation). | | answer_en | String | Evidence-based response in **English**. | | disease_category_tr | Categorical | The specific pathology or topic in Turkish. | | disease_category_en | Categorical | The specific pathology or topic in English (e.g., *Lateral Epicondylitis*). | | question_type_tr | Categorical | Classification of the query in Turkish. | | question_type_en | Categorical | Classification of the query in English (e.g., *Anatomy, Diagnosis*). | | difficulty_tr | Ordinal | Difficulty level in Turkish (*Kolay, Orta, Zor*). | | difficulty_en | Ordinal | Difficulty level in English (*Easy, Medium, Hard*). | | keywords_tr | List | Domain-specific terminology in Turkish. | | keywords_en | List | Domain-specific terminology in English. | | source_file | String | The original academic reference document. | | source_page | Integer | Page index in the source document for citation verification. | --- ## 📚 Citation Use this BibTeX to cite the repository until the research paper is published: ```bibtex @misc{kilic2025physiotherapyqa, author = {Kılıç, Serhan Ayberk and Derdiyok, Fatma Betül and Serbest, Kasım}, title = {Physiotherapy Evidence QA: A Bilingual Clinical Corpus}, year = {2025}, publisher = {Hugging Face}, journal = {Hugging Face Repository}, howpublished = {\url{https://huggingface.co/datasets/serhanayberkkilic/physiotherapy-evidence-qa}}, note = {dataset} } ``` --- ## ⚖️ License & Disclaimer **License:** This dataset is licensed under the **Creative Commons Attribution 4.0 International (CC BY 4.0)** license. **Disclaimer:** While this dataset is curated by professionals, it is intended for **research and educational purposes only**. It should not be used as a substitute for professional medical advice, diagnosis, or treatment. AI models trained on this data should undergo rigorous safety testing before clinical deployment.

提供机构：

serhanayberkkilic

5,000+

优质数据集

54 个

任务类型

进入经典数据集