five

ThaiLLM/med-app-instruct

收藏
Hugging Face2026-03-25 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ThaiLLM/med-app-instruct
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: conversations dtype: string - name: tool_name dtype: string splits: - name: qwen3_5_27b num_bytes: 2420420878 num_examples: 376439 - name: qwen3_5_plus num_bytes: 2150192481 num_examples: 376477 - name: claude_4_6_sonnet num_bytes: 2147726822 num_examples: 376477 download_size: 5369403541 dataset_size: 8942688074 configs: - config_name: default data_files: - split: qwen3_5_27b path: data/qwen3_5_27b-* - split: qwen3_5_plus path: data/qwen3_5_plus-* - split: claude_4_6_sonnet path: data/claude_4_6_sonnet-* license: mit task_categories: - text-generation language: - th - en pretty_name: med-app-instruct size_categories: - 100K<n<1M --- # ThaiLLM Medical Instruction with Tool Calling A synthetic Thai medical instruction-following dataset with tool calling capabilities, designed for training language models to handle healthcare-related queries through a mobile health assistant interface. ## Dataset Description This dataset contains multi-turn conversations between users and an AI health assistant, featuring both direct responses and tool-augmented interactions. The conversations simulate a realistic Thai healthcare application scenario where the assistant can invoke various medical tools to provide accurate, contextual assistance. ### Dataset Structure Each example follows the OpenAI chat completion format and is compatible with Hugging Face's [SFTTrainer](https://huggingface.co/docs/trl/en/sft_trainer) for fine-tuning. ```python { "conversations": [ {"role": "system", "content": "...system prompt with tool definitions..."}, {"role": "user", "content": "...user query in Thai..."}, {"role": "assistant", "tool_calls": [...]}, # Tool invocation {"role": "tool", "name": "tool_name", "content": "...tool results..."}, {"role": "assistant", "content": "...final response..."} ], "tool_name": "..." # The primary tool used in this conversation } ``` ### Data Splits | Split | Description | Response Mining Model | |:------|:------------|:----------------------| | `qwen3_5_27b` | Responses mined from Qwen3-235B-A22B (27B active params) | [Qwen/Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B) | | `qwen3_5_plus` | Responses mined from Qwen3.5-Plus | Qwen3.5-Plus (via OpenRouter) | | `claude_4_6_sonnet` | Responses mined from Claude Sonnet 4.6 | [Claude Sonnet 4.6](https://docs.anthropic.com/en/docs/about-claude/models) | ## Tools The dataset includes interactions with 7 healthcare-related tools: | Tool Name | Description | Response Format | |:----------|:------------|:----------------| | `search_medical_facts` | Retrieves relevant medical facts from a knowledge base to answer health-related questions | Structured response with `<response>` and `<reference>` tags containing citations | | `prescreen` | Initiates a symptom severity assessment pipeline with differential diagnosis | Recommendation based on severity classification | | `get_health_emergency_contact` | Returns Thailand emergency health hotlines (ambulance, poison control, mental health) | List of relevant emergency contacts | | `create_appointment` | Creates a new appointment with a hospital/clinic | Confirmation of appointment details | | `create_reminder` | Creates a medication reminder | Confirmation of reminder setup | | `list_appointment` | Retrieves and allows interaction with existing appointments | List of appointments or confirmation of edits | | `list_reminder` | Retrieves and allows interaction with existing medication reminders | List of reminders or confirmation of edits | ### Tool Categories - **Informational Queries (IQ):** `search_medical_facts` - Medical RAG with citation requirements - **Health Assessment:** `prescreen` - Symptom severity classification - **Emergency Services:** `get_health_emergency_contact` - Thailand-specific emergency hotlines - **Scheduling & Management:** `create_appointment`, `create_reminder`, `list_appointment`, `list_reminder` ## Data Generation Pipeline ### Source Data The dataset is constructed from multiple sources: 1. **Medical Facts:** Retrieved from [ThaiLLM/med-facts](https://huggingface.co/datasets/ThaiLLM/med-facts) and [ThaiLLM/med-articles](https://huggingface.co/datasets/ThaiLLM/med-articles) 2. **Medical Q&A:** Based on [ThaiLLM/med-qas-synthetic](https://huggingface.co/datasets/ThaiLLM/med-qas-synthetic) (refined baseline split) 3. **Synthetic Tool Queries:** Generated for appointment, reminder, prescreen, and emergency contact scenarios 4. **Negative Samples:** Sourced from [kunato/typhoon-s-instruct-post-training](https://huggingface.co/datasets/kunato/typhoon-s-instruct-post-training) for non-tool conversations ### Generation Process 1. **Query Synthesis:** User queries are synthetically generated based on predefined scenarios covering various medical and scheduling use cases 2. **Tool Mocking:** Tool responses are simulated with realistic data (appointments, reminders, medical facts, prescreen results) 3. **Response Mining:** Final assistant responses are mined from a large language model given the full conversation context 4. **Format Conversion:** Conversations are converted to SFTTrainer-compatible format ## Intended Use ### Primary Use Cases - Fine-tuning LLMs for Thai medical chatbot applications - Training models to properly invoke and respond to tool calls - Building healthcare virtual assistants with scheduling capabilities - Research on medical information retrieval with citations ### Out-of-Scope Use - This dataset should **NOT** be used for actual medical diagnosis - Not suitable for providing real medical advice without human oversight - The emergency contact information is specific to Thailand and may not apply to other regions ## Dataset Statistics | Split | Samples | |:------|--------:| | `qwen3_5_27b` | 376,439 | | `qwen3_5_plus` | 376,477 | | `claude_4_6_sonnet` | TBD | ### Distribution by Tool (per split, approximate) | Tool Name | Samples | Percentage | |:----------|--------:|-----------:| | `negatives` (no tool call) | 357,072 | 94.85% | | `search_medical_facts` | 14,126 | 3.75% | | `get_health_emergency_contact` | 1,106 | 0.29% | | `create_appointment` | 1,000 | 0.27% | | `create_reminder` | 1,000 | 0.27% | | `list_reminder` | 778 | 0.21% | | `list_appointment` | 773 | 0.21% | | `prescreen` | 622 | 0.17% | ## Limitations and Biases 1. **Synthetic Nature:** Responses are generated by LLMs and may contain hallucinations or inaccuracies 2. **Thailand-Specific:** Emergency contacts and some medical practices are specific to Thailand's healthcare system 3. **Language Bias:** Primarily designed for Thai language; English support is secondary 4. **Medical Disclaimer:** This is synthetic training data and should not be used for actual medical decisions 5. **Tool Simulation:** Tool outputs are mocked/simulated and do not represent real medical data ## Related Datasets - [ThaiLLM/med-articles](https://huggingface.co/datasets/ThaiLLM/med-articles) - Source medical articles - [ThaiLLM/med-facts](https://huggingface.co/datasets/ThaiLLM/med-facts) - Extracted medical facts - [ThaiLLM/med-qas-synthetic](https://huggingface.co/datasets/ThaiLLM/med-qas-synthetic) - Medical Q&A pairs - [ThaiLLM/med-qas-golden-articles](https://huggingface.co/datasets/ThaiLLM/med-qas-golden-articles) - Human-annotated gold-label data
提供机构:
ThaiLLM
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作