MIL-UT/Japanese-Medical-VQA-12m
收藏Hugging Face2026-03-14 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/MIL-UT/Japanese-Medical-VQA-12m
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ja
- en
license: cc-by-sa-4.0
pretty_name: Japanese Medical VQA 12M
tags:
- medical
- image-text
- multimodal
- japanese
- vision-language
- captioning
- visual-question-answering
- reasoning
viewer: true
configs:
- config_name: data
data_files:
- split: train
path: parquet/train-*.parquet
size_categories:
- 10M<n<100M
task_categories:
- image-to-text
- question-answering
---
# Japanese Medical VQA 12M
Japanese Medical VQA 12M is a large-scale Japanese medical multimodal dataset built from [Open-PMC-18M](https://huggingface.co/datasets/vector-institute/open-pmc-18m) and released in Parquet and Webdataset format.
This dataset contains outputs from multiple data-construction stages, including:
- source captions
- Japanese translations of source captions
- enriched captions
- Japanese translations of enriched captions
- question-answering
## Current Repository Format
This repository currently stores the dataset in Parquet/Webdataset format.
```text
.
├── README.md
├── parquet/
| ├── train-00000-of-XXXXX.parquet
| └── ...
└── webdataset/
├── dataset_part_XXXXX.tar
└── ...
```
## Data Schema
Each row is expected to contain the following columns:
* `id`: sample identifier
* `image`: image column
* `original_caption`: original caption in the source language
* `original_caption_ja`: Japanese translation of the original caption
* `enriched_caption`: recaptioned / enriched caption in the source language
* `enriched_caption_ja`: Japanese translation of the enriched caption
* `question`: generated question or instruction
* `answer`: generated target answer
## Data Construction Overview
1. start from **[Open-PMC-18M](https://huggingface.co/datasets/vector-institute/open-pmc-18m)**
2. remove non-commercially usable data
3. generate enriched captions
4. generate VQA-style supervision
5. remove generation failures
### Models Used in Data Construction
| Step | Input | Output | Model / method |
| --------------------------------- | -------------------------- | --------------------- | --------------- |
| Caption enrichment | image + source caption | enriched caption | [InternVL3.5 38B](https://huggingface.co/OpenGVLab/InternVL3_5-38B) |
| Caption translation | source or enriched captions | Japanese captions | [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507) |
| VQA generation | source and enriched captions | question-answer pair | [GPT-oss 120B](https://huggingface.co/openai/gpt-oss-120b) |
### Sample Count Transitions
This dataset includes only commercially usable samples and contains 12,125,556 samples.
| Step | Count before | Count after | Dropped at this step | Notes |
| ---------------------------------------- | -----------: | ----------: | -------------------: | ------------------------------------------ |
| Raw Open-PMC-18M | 17,867,999 | 17,867,999 | 0 | Initial source size |
| Commercial-use filtering | 17,867,999 | 12,125,556 | 5,742,443 | Non-commercial samples removed |
### Missing Values
During multi-stage automatic data construction, generation failures were replaced with empty strings (`""`) in one or more of the following columns:
- `original_caption_ja`
- `enriched_caption`
- `enriched_caption_ja`
- `question`
- `answer`
| Field | Missing / empty samples |
|---|---:|
| `original_caption_ja`| 2,674 |
| `enriched_caption` | 0 |
| `enriched_caption_ja`| 2,909 |
| `question` | 1,668 |
| `answer` | 1,668 |
## Loading the Dataset
### Parquet
```python
from huggingface_hub import snapshot_download
# Download the full dataset as parquet
repo_dir = snapshot_download(
repo_id="MIL-UT/Japanese-Medical-VQA-12m",
repo_type="parquet",
local_dir="Japanese-Medical-VQA-12m-parquet",
local_dir_use_symlinks=False,
)
print("Saved at:", repo_dir)
```
### Webdataset
```python
from huggingface_hub import snapshot_download
# Download the full dataset as webdataset
repo_dir = snapshot_download(
repo_id="MIL-UT/Japanese-Medical-VQA-12m",
repo_type="webdataset",
local_dir="Japanese-Medical-VQA-12m-webdataset",
local_dir_use_symlinks=False,
)
print("Saved at:", repo_dir)
```
## Note
For more details on the dataset construction process, preprocessing pipeline, and generation procedure, please refer to our [paper](https://www.anlp.jp/proceedings/annual_meeting/2026/pdf_dir/C4-15.pdf).
## Maintenance
please contact: ando \[at\] mi.t.u-tokyo.ac.jp
提供机构:
MIL-UT



