five

MBZUAI/BiMed-V-1.6M

收藏
Hugging Face2025-10-21 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/MBZUAI/BiMed-V-1.6M
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-sa-4.0 task_categories: - visual-question-answering - image-to-text language: - en - ar multilinguality: - multilingual size_categories: - 1M<n<10M pretty_name: BiMed-V-1.6M tags: - medical - biomedical - multimodal - vision-language - instruction-tuning - arabic - bilingual configs: - config_name: stage1 data_files: - split: train path: BiMed-V_stage1.json - config_name: stage2 data_files: - split: train path: BiMed-V_stage2.json dataset_info: features: - name: id dtype: string - name: image dtype: string - name: conversations list: - name: from dtype: string - name: value dtype: string - name: language dtype: string - name: modality dtype: string splits: - name: stage1 num_examples: 1691407 - name: stage2 num_examples: 467147 --- # BiMed-V-1.6M Dataset [![Website](https://img.shields.io/badge/Project-Website-87CEEB)](https://github.com/mbzuai-oryx/BiMediX2) [![Paper](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2412.07769) [![HuggingFace](https://img.shields.io/badge/HuggingFace-Model-F9D371)](https://huggingface.co/collections/MBZUAI/bimedix2-675ee7528464dfd03f746127) [![License](https://img.shields.io/badge/License-CC%20BY--NC--SA%204.0-lightgrey)](https://github.com/mbzuai-oryx/BiMediX/blob/main/LICENSE.txt) ## Dataset Description **BiMed-V** is a comprehensive **Arabic-English multimodal bilingual instruction dataset** comprising over **1.6M instructions** designed for training medical Large Multimodal Models (LMMs). This dataset was curated as part of the BiMediX2 project, the first bilingual (Arabic-English) Bio-Medical Expert LMM for diverse medical modalities. The dataset supports both English and Arabic languages, enabling the development of multilingual medical AI systems. ## Dataset Structure The dataset is split into two training stages: ### Files - **BiMed-V_stage1.json**: Pretraining data for vision-language projection alignment - **BiMed-V_stage2.json**: Instruction finetuning data for task-specific adaptation ### Data Fields Each sample in the dataset contains: - **id**: Unique identifier for the sample - **image**: Path or reference to the associated medical image (None for text-only instruction samples) - **conversations**: List of conversation turns with roles - **language**: Arabic or English - **modality**: Text or Vision ## Citation If you use BiMed-V dataset in your research, please cite: ```bibtex @misc{mullappilly2024bimedix2biomedicalexpertlmm, title={BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities}, author={Sahal Shaji Mullappilly and Mohammed Irfan Kurpath and Sara Pieri and Saeed Yahya Alseiari and Shanavas Cholakkal and Khaled Aldahmani and Fahad Khan and Rao Anwer and Salman Khan and Timothy Baldwin and Hisham Cholakkal}, year={2024}, eprint={2412.07769}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2412.07769}, } ``` ## License BiMed-V dataset is released under the **CC-BY-NC-SA 4.0 License**. For more details, please refer to the [LICENSE](https://github.com/mbzuai-oryx/BiMediX/blob/main/LICENSE.txt) file. ## Ethical Considerations ⚠️ **Important Notice**: This dataset is intended for **research purposes only** and is **not ready for clinical or commercial use**. Users should be aware that: - Medical AI models trained on this data should be validated by qualified healthcare professionals - The dataset should not be used as the sole basis for medical diagnoses or treatment decisions - Outputs from models trained on this data may contain errors, biases, or hallucinations - Always verify AI-generated medical information with licensed healthcare providers
提供机构:
MBZUAI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作