five

MIL-UT/DEJIMA-dataset

收藏
Hugging Face2025-12-02 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/MIL-UT/DEJIMA-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: DEJIMA Dataset dataset_summary: DEJIMA is a large-scale Japanese multimodal (image + text) dataset built from web-scale images and text via a scalable, detection-driven, LLM-based pipeline. It consists of 3.88M image–caption pairs (DEJIMA-Cap) and 3.88M image–VQA pairs (DEJIMA-VQA), all in Japanese, with multiple variants that isolate the effect of alt-text refinement and detection-based grounding. language: - ja multilinguality: - monolingual license: apache-2.0 size_categories: - 1M<n<10M task_categories: - image-to-text - visual-question-answering task_ids: - image-captioning - visual-question-answering configs: # Captioning - config_name: cap-simple data_files: - split: train path: "data/dejima-cap-simple.jsonl" - config_name: cap-refined data_files: - split: train path: "data/dejima-cap-refined.jsonl" - config_name: cap-detection data_files: - split: train path: "data/dejima-cap-detection.jsonl" - config_name: cap-all data_files: - split: train path: "data/dejima-cap-all.jsonl" default: true # load_dataset("MIL-UT/DEJIMA-dataset") でこれが選ばれる # VQA - config_name: vqa-refined data_files: - split: train path: "data/dejima-vqa-refined.jsonl" - config_name: vqa-detection data_files: - split: train path: "data/dejima-vqa-detection.jsonl" - config_name: vqa-all data_files: - split: train path: "data/dejima-vqa-all.jsonl" --- # DEJIMA Dataset ## Overview **DEJIMA** is a large-scale Japanese multimodal (image + text) dataset constructed through a scalable and fully reproducible pipeline combining: - Web-scale image collection - Strict filtering and deduplication - Detection-driven evidence extraction - LLM-based caption/VQA generation under grounding constraints DEJIMA contains: - **3.88M image–caption pairs (DEJIMA-Cap)** - **3.88M image–question–answer pairs (DEJIMA-VQA)** All annotations are in **Japanese**. Each example is composed of: ### Captioning (`cap-*`) - `id`: unique integer ID for the image–caption example - `url`: HTTP(S) URL to the original web image (image pixels **not redistributed**) - `caption`: Japanese caption sentence(s), generated/refined by an LLM ### VQA (`vqa-*`) - `id`: unique integer ID for the image–QA example - `url`: HTTP(S) URL to the original web image - `question`: Japanese open-ended question about the image - `answer`: Japanese free-form answer Related resources can be found below: - **Project page**: [mil-tokyo/DEJIMA-dataset](https://mil-tokyo.github.io/DEJIMA-dataset) - **Dataset construction code**: [mil-tokyo/DEJIMA-construct](https://github.com/mil-tokyo/DEJIMA-construct) - **Training / inference code**: [mil-tokyo/DEJIMA-VLM](https://github.com/mil-tokyo/DEJIMA-VLM) - **Dataset (Hugging Face)**: [MIL-UT/DEJIMA-dataset](https://huggingface.co/datasets/MIL-UT/DEJIMA-dataset) --- ## Dataset Variants To isolate the contribution of each pipeline component, DEJIMA provides several variants for both captioning and VQA. ### Captioning - **DEJIMA-Cap-Simple** Filtered raw image–alt-text pairs. - **DEJIMA-Cap-Refined** LLM-refined captions starting from alt-text. - **DEJIMA-Cap-Detection** Captions generated using only detection tags. - **DEJIMA-Cap-All** Captions generated using both alt-text and detection tags as inputs. ### VQA - **DEJIMA-VQA-Refined** Generated from alt-text using LLM. - **DEJIMA-VQA-Detection** Generated from detection tags only. - **DEJIMA-VQA-All** Generated from both alt-text & detection-based evidence. --- ## Files ### Caption subsets - `dejima-cap-simple.jsonl` - `dejima-cap-refined.jsonl` - `dejima-cap-detection.jsonl` - `dejima-cap-all.jsonl` ### VQA subsets - `dejima-vqa-refined.jsonl` - `dejima-vqa-detection.jsonl` - `dejima-vqa-all.jsonl` Each file is a JSONL list of machine-generated annotations with the fields described above. --- ## Usage Load any variant using the `name` corresponding to its task and variant: ```python from datasets import load_dataset ds = load_dataset("MIL-UT/DEJIMA-dataset", "cap-all", split="train") print(ds[0]) ```` Available builder configs: * `cap-simple` * `cap-refined` * `cap-detection` * `cap-all` * `vqa-refined` * `vqa-detection` * `vqa-all` --- ## Statistics | Dataset | Type | # Images | # Texts | Avg. # Chars | Vocabulary Size | | -------------------- | --------------------- | --------: | --------: | -----------: | --------------: | | DEJIMA-Cap-Simple | Alt | 3,884,632 | 3,884,632 | 18.21 | 336,924 | | DEJIMA-Cap-Refined | Alt + LLM | 3,884,629 | 3,884,629 | 38.03 | 314,900 | | DEJIMA-Cap-Detection | Detection + LLM | 3,884,632 | 3,884,632 | 49.55 | 30,674 | | DEJIMA-Cap-All | Alt + Detection + LLM | 3,884,632 | 3,884,632 | 79.62 | 287,434 | | DEJIMA-VQA-Refined | Alt + LLM | 3,875,343 | 3,875,343 | 56.62 | 321,720 | | DEJIMA-VQA-Detection | Detection + LLM | 3,883,943 | 3,883,943 | 77.00 | 31,929 | | DEJIMA-VQA-All | Alt + Detection + LLM | 3,882,892 | 3,882,892 | 108.86 | 278,860 | --- ## License This dataset is released under the **Apache License 2.0**. * The **annotations** (`id`, `caption`, `question`, `answer`) and the **dataset structure** (JSONL files, indexing, metadata) are licensed under **Apache 2.0**. * The **images referenced via `url` are *not* included in this license**. Each image retains the copyright and license of its original source. We redistribute **only URLs**, not the image files themselves. When accessing the images, please follow the respective website’s terms of use and copyright conditions. --- ## Project & Models * Project page: [https://mil-tokyo.github.io/DEJIMA-dataset](https://mil-tokyo.github.io/DEJIMA-dataset) * Code: [https://github.com/mil-tokyo/DEJIMA-construct](https://github.com/mil-tokyo/DEJIMA-construct) * Dataset: [https://huggingface.co/datasets/MIL-UT/DEJIMA-dataset](https://huggingface.co/datasets/MIL-UT/DEJIMA-dataset) * Models: [https://huggingface.co/MIL-UT/DEJIMA-models](https://huggingface.co/MIL-UT/DEJIMA-models) --- ## Citation If you use DEJIMA in your research, please cite our paper (to appear). ```bibtex @misc{katsube2025dejimanovellargescalejapanese, title={DEJIMA: A Novel Large-scale Japanese Dataset for Image Captioning and Visual Question Answering}, author={Toshiki Katsube and Taiga Fukuhara and Kenichiro Ando and Yusuke Mukuta and Kohei Uehara and Tatsuya Harada}, year={2025}, eprint={2512.00773}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2512.00773}, } ```
提供机构:
MIL-UT
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作