ThaiLLM/med-app-instruct
收藏Hugging Face2026-03-25 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ThaiLLM/med-app-instruct
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: conversations
dtype: string
- name: tool_name
dtype: string
splits:
- name: qwen3_5_27b
num_bytes: 2420420878
num_examples: 376439
- name: qwen3_5_plus
num_bytes: 2150192481
num_examples: 376477
- name: claude_4_6_sonnet
num_bytes: 2147726822
num_examples: 376477
download_size: 5369403541
dataset_size: 8942688074
configs:
- config_name: default
data_files:
- split: qwen3_5_27b
path: data/qwen3_5_27b-*
- split: qwen3_5_plus
path: data/qwen3_5_plus-*
- split: claude_4_6_sonnet
path: data/claude_4_6_sonnet-*
license: mit
task_categories:
- text-generation
language:
- th
- en
pretty_name: med-app-instruct
size_categories:
- 100K<n<1M
---
# ThaiLLM Medical Instruction with Tool Calling
A synthetic Thai medical instruction-following dataset with tool calling capabilities, designed for training language models to handle healthcare-related queries through a mobile health assistant interface.
## Dataset Description
This dataset contains multi-turn conversations between users and an AI health assistant, featuring both direct responses and tool-augmented interactions. The conversations simulate a realistic Thai healthcare application scenario where the assistant can invoke various medical tools to provide accurate, contextual assistance.
### Dataset Structure
Each example follows the OpenAI chat completion format and is compatible with Hugging Face's [SFTTrainer](https://huggingface.co/docs/trl/en/sft_trainer) for fine-tuning.
```python
{
"conversations": [
{"role": "system", "content": "...system prompt with tool definitions..."},
{"role": "user", "content": "...user query in Thai..."},
{"role": "assistant", "tool_calls": [...]}, # Tool invocation
{"role": "tool", "name": "tool_name", "content": "...tool results..."},
{"role": "assistant", "content": "...final response..."}
],
"tool_name": "..." # The primary tool used in this conversation
}
```
### Data Splits
| Split | Description | Response Mining Model |
|:------|:------------|:----------------------|
| `qwen3_5_27b` | Responses mined from Qwen3-235B-A22B (27B active params) | [Qwen/Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B) |
| `qwen3_5_plus` | Responses mined from Qwen3.5-Plus | Qwen3.5-Plus (via OpenRouter) |
| `claude_4_6_sonnet` | Responses mined from Claude Sonnet 4.6 | [Claude Sonnet 4.6](https://docs.anthropic.com/en/docs/about-claude/models) |
## Tools
The dataset includes interactions with 7 healthcare-related tools:
| Tool Name | Description | Response Format |
|:----------|:------------|:----------------|
| `search_medical_facts` | Retrieves relevant medical facts from a knowledge base to answer health-related questions | Structured response with `<response>` and `<reference>` tags containing citations |
| `prescreen` | Initiates a symptom severity assessment pipeline with differential diagnosis | Recommendation based on severity classification |
| `get_health_emergency_contact` | Returns Thailand emergency health hotlines (ambulance, poison control, mental health) | List of relevant emergency contacts |
| `create_appointment` | Creates a new appointment with a hospital/clinic | Confirmation of appointment details |
| `create_reminder` | Creates a medication reminder | Confirmation of reminder setup |
| `list_appointment` | Retrieves and allows interaction with existing appointments | List of appointments or confirmation of edits |
| `list_reminder` | Retrieves and allows interaction with existing medication reminders | List of reminders or confirmation of edits |
### Tool Categories
- **Informational Queries (IQ):** `search_medical_facts` - Medical RAG with citation requirements
- **Health Assessment:** `prescreen` - Symptom severity classification
- **Emergency Services:** `get_health_emergency_contact` - Thailand-specific emergency hotlines
- **Scheduling & Management:** `create_appointment`, `create_reminder`, `list_appointment`, `list_reminder`
## Data Generation Pipeline
### Source Data
The dataset is constructed from multiple sources:
1. **Medical Facts:** Retrieved from [ThaiLLM/med-facts](https://huggingface.co/datasets/ThaiLLM/med-facts) and [ThaiLLM/med-articles](https://huggingface.co/datasets/ThaiLLM/med-articles)
2. **Medical Q&A:** Based on [ThaiLLM/med-qas-synthetic](https://huggingface.co/datasets/ThaiLLM/med-qas-synthetic) (refined baseline split)
3. **Synthetic Tool Queries:** Generated for appointment, reminder, prescreen, and emergency contact scenarios
4. **Negative Samples:** Sourced from [kunato/typhoon-s-instruct-post-training](https://huggingface.co/datasets/kunato/typhoon-s-instruct-post-training) for non-tool conversations
### Generation Process
1. **Query Synthesis:** User queries are synthetically generated based on predefined scenarios covering various medical and scheduling use cases
2. **Tool Mocking:** Tool responses are simulated with realistic data (appointments, reminders, medical facts, prescreen results)
3. **Response Mining:** Final assistant responses are mined from a large language model given the full conversation context
4. **Format Conversion:** Conversations are converted to SFTTrainer-compatible format
## Intended Use
### Primary Use Cases
- Fine-tuning LLMs for Thai medical chatbot applications
- Training models to properly invoke and respond to tool calls
- Building healthcare virtual assistants with scheduling capabilities
- Research on medical information retrieval with citations
### Out-of-Scope Use
- This dataset should **NOT** be used for actual medical diagnosis
- Not suitable for providing real medical advice without human oversight
- The emergency contact information is specific to Thailand and may not apply to other regions
## Dataset Statistics
| Split | Samples |
|:------|--------:|
| `qwen3_5_27b` | 376,439 |
| `qwen3_5_plus` | 376,477 |
| `claude_4_6_sonnet` | TBD |
### Distribution by Tool (per split, approximate)
| Tool Name | Samples | Percentage |
|:----------|--------:|-----------:|
| `negatives` (no tool call) | 357,072 | 94.85% |
| `search_medical_facts` | 14,126 | 3.75% |
| `get_health_emergency_contact` | 1,106 | 0.29% |
| `create_appointment` | 1,000 | 0.27% |
| `create_reminder` | 1,000 | 0.27% |
| `list_reminder` | 778 | 0.21% |
| `list_appointment` | 773 | 0.21% |
| `prescreen` | 622 | 0.17% |
## Limitations and Biases
1. **Synthetic Nature:** Responses are generated by LLMs and may contain hallucinations or inaccuracies
2. **Thailand-Specific:** Emergency contacts and some medical practices are specific to Thailand's healthcare system
3. **Language Bias:** Primarily designed for Thai language; English support is secondary
4. **Medical Disclaimer:** This is synthetic training data and should not be used for actual medical decisions
5. **Tool Simulation:** Tool outputs are mocked/simulated and do not represent real medical data
## Related Datasets
- [ThaiLLM/med-articles](https://huggingface.co/datasets/ThaiLLM/med-articles) - Source medical articles
- [ThaiLLM/med-facts](https://huggingface.co/datasets/ThaiLLM/med-facts) - Extracted medical facts
- [ThaiLLM/med-qas-synthetic](https://huggingface.co/datasets/ThaiLLM/med-qas-synthetic) - Medical Q&A pairs
- [ThaiLLM/med-qas-golden-articles](https://huggingface.co/datasets/ThaiLLM/med-qas-golden-articles) - Human-annotated gold-label data
提供机构:
ThaiLLM



