MBZUAI/BiMed-V-1.6M

Name: MBZUAI/BiMed-V-1.6M
Creator: MBZUAI
Published: 2025-10-21 13:44:10
License: 暂无描述

Hugging Face2025-10-21 更新2026-01-03 收录

下载链接：

https://hf-mirror.com/datasets/MBZUAI/BiMed-V-1.6M

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-sa-4.0 task_categories: - visual-question-answering - image-to-text language: - en - ar multilinguality: - multilingual size_categories: - 1M<n<10M pretty_name: BiMed-V-1.6M tags: - medical - biomedical - multimodal - vision-language - instruction-tuning - arabic - bilingual configs: - config_name: stage1 data_files: - split: train path: BiMed-V_stage1.json - config_name: stage2 data_files: - split: train path: BiMed-V_stage2.json dataset_info: features: - name: id dtype: string - name: image dtype: string - name: conversations list: - name: from dtype: string - name: value dtype: string - name: language dtype: string - name: modality dtype: string splits: - name: stage1 num_examples: 1691407 - name: stage2 num_examples: 467147 --- # BiMed-V-1.6M Dataset [![Website](https://img.shields.io/badge/Project-Website-87CEEB)](https://github.com/mbzuai-oryx/BiMediX2) [![Paper](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2412.07769) [![HuggingFace](https://img.shields.io/badge/HuggingFace-Model-F9D371)](https://huggingface.co/collections/MBZUAI/bimedix2-675ee7528464dfd03f746127) [![License](https://img.shields.io/badge/License-CC%20BY--NC--SA%204.0-lightgrey)](https://github.com/mbzuai-oryx/BiMediX/blob/main/LICENSE.txt) ## Dataset Description **BiMed-V** is a comprehensive **Arabic-English multimodal bilingual instruction dataset** comprising over **1.6M instructions** designed for training medical Large Multimodal Models (LMMs). This dataset was curated as part of the BiMediX2 project, the first bilingual (Arabic-English) Bio-Medical Expert LMM for diverse medical modalities. The dataset supports both English and Arabic languages, enabling the development of multilingual medical AI systems. ## Dataset Structure The dataset is split into two training stages: ### Files - **BiMed-V_stage1.json**: Pretraining data for vision-language projection alignment - **BiMed-V_stage2.json**: Instruction finetuning data for task-specific adaptation ### Data Fields Each sample in the dataset contains: - **id**: Unique identifier for the sample - **image**: Path or reference to the associated medical image (None for text-only instruction samples) - **conversations**: List of conversation turns with roles - **language**: Arabic or English - **modality**: Text or Vision ## Citation If you use BiMed-V dataset in your research, please cite: ```bibtex @misc{mullappilly2024bimedix2biomedicalexpertlmm, title={BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities}, author={Sahal Shaji Mullappilly and Mohammed Irfan Kurpath and Sara Pieri and Saeed Yahya Alseiari and Shanavas Cholakkal and Khaled Aldahmani and Fahad Khan and Rao Anwer and Salman Khan and Timothy Baldwin and Hisham Cholakkal}, year={2024}, eprint={2412.07769}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2412.07769}, } ``` ## License BiMed-V dataset is released under the **CC-BY-NC-SA 4.0 License**. For more details, please refer to the [LICENSE](https://github.com/mbzuai-oryx/BiMediX/blob/main/LICENSE.txt) file. ## Ethical Considerations ⚠️ **Important Notice**: This dataset is intended for **research purposes only** and is **not ready for clinical or commercial use**. Users should be aware that: - Medical AI models trained on this data should be validated by qualified healthcare professionals - The dataset should not be used as the sole basis for medical diagnoses or treatment decisions - Outputs from models trained on this data may contain errors, biases, or hallucinations - Always verify AI-generated medical information with licensed healthcare providers

提供机构：

MBZUAI

5,000+

优质数据集

54 个

任务类型

进入经典数据集