MIL-UT/Japanese-Medical-VQA-12m

Name: MIL-UT/Japanese-Medical-VQA-12m
Creator: MIL-UT
Published: 2026-03-14 09:06:06
License: 暂无描述

Hugging Face2026-03-14 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/MIL-UT/Japanese-Medical-VQA-12m

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - ja - en license: cc-by-sa-4.0 pretty_name: Japanese Medical VQA 12M tags: - medical - image-text - multimodal - japanese - vision-language - captioning - visual-question-answering - reasoning viewer: true configs: - config_name: data data_files: - split: train path: parquet/train-*.parquet size_categories: - 10M<n<100M task_categories: - image-to-text - question-answering --- # Japanese Medical VQA 12M Japanese Medical VQA 12M is a large-scale Japanese medical multimodal dataset built from [Open-PMC-18M](https://huggingface.co/datasets/vector-institute/open-pmc-18m) and released in Parquet and Webdataset format. This dataset contains outputs from multiple data-construction stages, including: - source captions - Japanese translations of source captions - enriched captions - Japanese translations of enriched captions - question-answering ## Current Repository Format This repository currently stores the dataset in Parquet/Webdataset format. ```text . ├── README.md ├── parquet/ | ├── train-00000-of-XXXXX.parquet | └── ... └── webdataset/ ├── dataset_part_XXXXX.tar └── ... ``` ## Data Schema Each row is expected to contain the following columns: * `id`: sample identifier * `image`: image column * `original_caption`: original caption in the source language * `original_caption_ja`: Japanese translation of the original caption * `enriched_caption`: recaptioned / enriched caption in the source language * `enriched_caption_ja`: Japanese translation of the enriched caption * `question`: generated question or instruction * `answer`: generated target answer ## Data Construction Overview 1. start from **[Open-PMC-18M](https://huggingface.co/datasets/vector-institute/open-pmc-18m)** 2. remove non-commercially usable data 3. generate enriched captions 4. generate VQA-style supervision 5. remove generation failures ### Models Used in Data Construction | Step | Input | Output | Model / method | | --------------------------------- | -------------------------- | --------------------- | --------------- | | Caption enrichment | image + source caption | enriched caption | [InternVL3.5 38B](https://huggingface.co/OpenGVLab/InternVL3_5-38B) | | Caption translation | source or enriched captions | Japanese captions | [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507) | | VQA generation | source and enriched captions | question-answer pair | [GPT-oss 120B](https://huggingface.co/openai/gpt-oss-120b) | ### Sample Count Transitions This dataset includes only commercially usable samples and contains 12,125,556 samples. | Step | Count before | Count after | Dropped at this step | Notes | | ---------------------------------------- | -----------: | ----------: | -------------------: | ------------------------------------------ | | Raw Open-PMC-18M | 17,867,999 | 17,867,999 | 0 | Initial source size | | Commercial-use filtering | 17,867,999 | 12,125,556 | 5,742,443 | Non-commercial samples removed | ### Missing Values During multi-stage automatic data construction, generation failures were replaced with empty strings (`""`) in one or more of the following columns: - `original_caption_ja` - `enriched_caption` - `enriched_caption_ja` - `question` - `answer` | Field | Missing / empty samples | |---|---:| | `original_caption_ja`| 2,674 | | `enriched_caption` | 0 | | `enriched_caption_ja`| 2,909 | | `question` | 1,668 | | `answer` | 1,668 | ## Loading the Dataset ### Parquet ```python from huggingface_hub import snapshot_download # Download the full dataset as parquet repo_dir = snapshot_download( repo_id="MIL-UT/Japanese-Medical-VQA-12m", repo_type="parquet", local_dir="Japanese-Medical-VQA-12m-parquet", local_dir_use_symlinks=False, ) print("Saved at:", repo_dir) ``` ### Webdataset ```python from huggingface_hub import snapshot_download # Download the full dataset as webdataset repo_dir = snapshot_download( repo_id="MIL-UT/Japanese-Medical-VQA-12m", repo_type="webdataset", local_dir="Japanese-Medical-VQA-12m-webdataset", local_dir_use_symlinks=False, ) print("Saved at:", repo_dir) ``` ## Note For more details on the dataset construction process, preprocessing pipeline, and generation procedure, please refer to our [paper](https://www.anlp.jp/proceedings/annual_meeting/2026/pdf_dir/C4-15.pdf). ## Maintenance please contact: ando \[at\] mi.t.u-tokyo.ac.jp

提供机构：

MIL-UT

5,000+

优质数据集

54 个

任务类型

进入经典数据集