ziyul707/InsightVQA

Name: ziyul707/InsightVQA
Creator: ziyul707
Published: 2026-04-06 14:06:50
License: 暂无描述

Hugging Face2026-04-06 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/ziyul707/InsightVQA

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: cc-by-4.0 task_categories: - visual-question-answering - image-classification tags: - emotion-recognition - multimodal - multi-level-reasoning size_categories: - 100K<n<1M --- # InsightVQA: High-Dimensional Emotion-Cognitive Visual Question Answering Benchmark [![Project Page](https://img.shields.io/badge/Project-Page-blue)](https://akanthawang.github.io/InsightVQA) [![Dataset](https://img.shields.io/badge/Dataset-InsightVQA-green)](https://huggingface.co/datasets/ziyul707/InsightVQA) [![License](https://img.shields.io/badge/License-CC%20BY%204.0-orange)](LICENSE) ## Overview **InsightVQA** is a large-scale dataset designed for hierarchical visual question answering that bridges emotion understanding and cognitive reasoning. While existing benchmarks predominantly focus on surface-level emotion recognition , InsightVQA introduces a structured paradigm to evaluate a model's ability to interpret emotional causes, ground evidence, and reason about underlying cognitive processes. Built from rigorously curated and validated images , this dataset challenges multimodal models to move beyond discrete label prediction toward interpretable, human-centered cognitive computing. ## Dataset Structure ### File Organization The repository provides the data in compressed formats (`.tar.gz`) for efficient downloading, alongside the compiled JSONL QA pairs. Upon extraction, the logical directory structure is as follows: ```tex InsightVQA/ ├── Images/ # 138,008 images (extracted from Images.tar.gz) │ ├── amusement/ # amusement_00001.jpg, ... │ ├── anger/ # anger_00001.jpg, ... │ ├── awe/ │ ├── contentment/ │ ├── disgust/ │ ├── excitement/ │ ├── fear/ │ └── sadness/ ├── Annotation/ # Detailed JSON annotations (extracted from Annotation.tar.gz) │ ├── Perception/ # Annotations for label and valence │ │ ├── amusement/ # Detailed JSON files mapping to amusement images │ │ └── ... (8 emotion folders) │ ├── Understanding/ # Annotations for visual triggers and reasoning │ │ ├── amusement/ │ │ └── ... (8 emotion folders) │ └── Cognition/ # Annotations for response intent and insight sequences │ ├── amusement/ │ └── ... (8 emotion folders) ├── train.jsonl # 653,292 QA pairs └── test.jsonl # 30,841 QA pairs ``` **Understanding the `Annotation` Directory:** While `train.jsonl` and `test.jsonl` provide the ready-to-use Question-Answering pairs for model training and evaluation, the `Annotation` directory contains the granular, raw JSON files for each individual image. These files are hierarchically organized by the three cognitive layers (**Perception**, **Understanding**, **Cognition**) and then subdivided by the 8 emotion categories. They provide deeper insights into the intermediate reasoning steps and metadata used to construct the final dataset. ### Annotation Format Both `train.jsonl` and `test.jsonl` follow a unified format. The `answer` field specifically encapsulates the ground truth within an `<answer>` tag. **1. Perception Layer** ```json { "image_path": "Images/amusement/amusement_019825.jpg", "type": "Perception", "question": "What kind of feeling does the image evoke? Please select the emotion closest to the image from the following options: amusement, anger, awe, contentment, disgust, excitement, fear and sadness. Please ensure the result is formatted as follows: <answer></answer>.", "answer": "<answer>amusement</answer>" }, { "image_path": "Images/excitement/excitement_007484.jpg", "type": "Perception", "question": "Does the emotional quality of the image feel positive or negative? Please select the emotion closest to the image from the following options: positive, negative. Please ensure the result is formatted as follows: <answer></answer>.", "answer": "<answer>positive</answer>" } ``` **2. Understanding Layer** ```Json { "image_path": "Images/amusement/amusement_016011.jpg", "type": "Understanding", "question": "How do the decorations' form and surface jointly build a whimsical look? Please ensure the result is formatted as follows: <answer></answer>.", "answer": "<answer>The orange star-shaped decorations and their smooth glossy surface combine for a bright, playful visual configuration.</answer>" } ``` **3. Cognition Layer** ```json { "image_path": "Images/awe/awe_016410.jpg", "type": "Cognition", "question": "What would your immediate response intent be if you encountered this scene? Please select the instinctive reaction that best matches the content of the image from the following options: acknowledge, comfort, encourage, celebrate, practical_help, investigate, deescalate and redirect. Please ensure the result is formatted as follows: <answer></answer>.", "answer": "<answer>acknowledge</answer>" }, { "image_path": "Images/fear/fear_004696.jpg", "type": "Cognition", "question": "Describe the unfolding sequence of somatic, semantic, and regulatory responses to this scene. Please ensure the result is formatted as follows: <answer></answer>.", "answer": "<answer>The light through that open shutter is creating a harsh, uneven feeling. Reach toward the open shutter to pull it closed.</answer>" } ``` ## Dataset Statistics InsightVQA is built upon a high-confidence perception foundation of **138,008** well-balanced images. Through our rigorous annotation pipeline, the dataset yields a total of **725K** question-answer pairs, which are meticulously structured across the three cognitive layers. ### 1. Annotations by Cognitive Layer The raw annotations are distributed across the three hierarchical stages to support multi-level reasoning: | **Cognitive Layer** | **Total QA Pairs** | **Included Tasks** | | ------------------- | ------------------ | ------------------------------------------------------------ | | **Perception** | 276K | Emotion classification and valence recognition | | **Understanding** | 330K | Visual attribution, contextual synthesis, and counterfactual reasoning | | **Cognition** | 119K | Response intent and situational insight sequences (somatic, semantic, regulatory) | | **Total** | **725K** | **The complete hierarchical annotation pool** | ### 2. Benchmark Splits To support standardized training and evaluation, a curated subset of the annotations is formatted into the final benchmark splits (`train.jsonl` and `test.jsonl`): | **Split** | **Images** | **QA Pairs** | **Purpose** | | --------- | ---------- | ------------ | ------------------------------------------------------------ | | **Train** | 124K | 653,292 | Generative QA format for foundational reasoning training | | **Test** | 14K | 30,841 | Standardized discriminative tasks (MCQ/SJT) for fine-grained evaluation | ## Three-Tier Cognitive Architecture InsightVQA formulates human-centered visual understanding through three progressively deeper stages: ### 1. Perception Serves as the entry-level stage. It evaluates the model's ability to identify basic emotional states and valence from visual inputs. ### 2. Understanding Elucidates *why* an emotion is perceived by grounding the reasoning process in verifiable visual evidence. - **Visual Triggers:** Models must utilize appearance cues, scene cues, and agent cues. - **Reasoning Types:** Includes visual attribution, contextual synthesis, and counterfactual reasoning. ### 3. Cognition Focuses on higher-order cognitive reasoning and grounded response planning. - **Response Intent:** Predicting the instinctive intent if encountering the scene. - **Insight Sequences:** Evaluating the natural unfolding of somatic, semantic, and regulatory responses in a Situational Judgment Test format. ## Applications & Use Cases - **Multimodal Large Language Models:** Benchmarking advanced reasoning capabilities, contextual dependencies, and cognitive-affective interactions. - **Affective Computing:** Developing AI systems capable of deep emotion understanding for human-computer interaction and socially assistive robotics. - **Visual Grounding:** Testing a model's ability to tie abstract emotional states to concrete, observable visual evidence.

提供机构：

ziyul707

5,000+

优质数据集

54 个

任务类型

进入经典数据集