ziyul707/InsightVQA
收藏Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ziyul707/InsightVQA
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-4.0
task_categories:
- visual-question-answering
- image-classification
tags:
- emotion-recognition
- multimodal
- multi-level-reasoning
size_categories:
- 100K<n<1M
---
# InsightVQA: High-Dimensional Emotion-Cognitive Visual Question Answering Benchmark
[](https://akanthawang.github.io/InsightVQA)
[](https://huggingface.co/datasets/ziyul707/InsightVQA)
[](LICENSE)
## Overview
**InsightVQA** is a large-scale dataset designed for hierarchical visual question answering that bridges emotion understanding and cognitive reasoning. While existing benchmarks predominantly focus on surface-level emotion recognition , InsightVQA introduces a structured paradigm to evaluate a model's ability to interpret emotional causes, ground evidence, and reason about underlying cognitive processes.
Built from rigorously curated and validated images , this dataset challenges multimodal models to move beyond discrete label prediction toward interpretable, human-centered cognitive computing.
## Dataset Structure
### File Organization
The repository provides the data in compressed formats (`.tar.gz`) for efficient downloading, alongside the compiled JSONL QA pairs. Upon extraction, the logical directory structure is as follows:
```tex
InsightVQA/
├── Images/ # 138,008 images (extracted from Images.tar.gz)
│ ├── amusement/ # amusement_00001.jpg, ...
│ ├── anger/ # anger_00001.jpg, ...
│ ├── awe/
│ ├── contentment/
│ ├── disgust/
│ ├── excitement/
│ ├── fear/
│ └── sadness/
├── Annotation/ # Detailed JSON annotations (extracted from Annotation.tar.gz)
│ ├── Perception/ # Annotations for label and valence
│ │ ├── amusement/ # Detailed JSON files mapping to amusement images
│ │ └── ... (8 emotion folders)
│ ├── Understanding/ # Annotations for visual triggers and reasoning
│ │ ├── amusement/
│ │ └── ... (8 emotion folders)
│ └── Cognition/ # Annotations for response intent and insight sequences
│ ├── amusement/
│ └── ... (8 emotion folders)
├── train.jsonl # 653,292 QA pairs
└── test.jsonl # 30,841 QA pairs
```
**Understanding the `Annotation` Directory:**
While `train.jsonl` and `test.jsonl` provide the ready-to-use Question-Answering pairs for model training and evaluation, the `Annotation` directory contains the granular, raw JSON files for each individual image. These files are hierarchically organized by the three cognitive layers (**Perception**, **Understanding**, **Cognition**) and then subdivided by the 8 emotion categories. They provide deeper insights into the intermediate reasoning steps and metadata used to construct the final dataset.
### Annotation Format
Both `train.jsonl` and `test.jsonl` follow a unified format. The `answer` field specifically encapsulates the ground truth within an `<answer>` tag.
**1. Perception Layer**
```json
{
"image_path": "Images/amusement/amusement_019825.jpg",
"type": "Perception",
"question": "What kind of feeling does the image evoke? Please select the emotion closest to the image from the following options: amusement, anger, awe, contentment, disgust, excitement, fear and sadness. Please ensure the result is formatted as follows: <answer></answer>.",
"answer": "<answer>amusement</answer>"
},
{
"image_path": "Images/excitement/excitement_007484.jpg",
"type": "Perception",
"question": "Does the emotional quality of the image feel positive or negative? Please select the emotion closest to the image from the following options: positive, negative. Please ensure the result is formatted as follows: <answer></answer>.",
"answer": "<answer>positive</answer>"
}
```
**2. Understanding Layer**
```Json
{
"image_path": "Images/amusement/amusement_016011.jpg",
"type": "Understanding",
"question": "How do the decorations' form and surface jointly build a whimsical look? Please ensure the result is formatted as follows: <answer></answer>.",
"answer": "<answer>The orange star-shaped decorations and their smooth glossy surface combine for a bright, playful visual configuration.</answer>"
}
```
**3. Cognition Layer**
```json
{
"image_path": "Images/awe/awe_016410.jpg",
"type": "Cognition",
"question": "What would your immediate response intent be if you encountered this scene? Please select the instinctive reaction that best matches the content of the image from the following options: acknowledge, comfort, encourage, celebrate, practical_help, investigate, deescalate and redirect. Please ensure the result is formatted as follows: <answer></answer>.",
"answer": "<answer>acknowledge</answer>"
},
{
"image_path": "Images/fear/fear_004696.jpg",
"type": "Cognition",
"question": "Describe the unfolding sequence of somatic, semantic, and regulatory responses to this scene. Please ensure the result is formatted as follows: <answer></answer>.",
"answer": "<answer>The light through that open shutter is creating a harsh, uneven feeling. Reach toward the open shutter to pull it closed.</answer>"
}
```
## Dataset Statistics
InsightVQA is built upon a high-confidence perception foundation of **138,008** well-balanced images. Through our rigorous annotation pipeline, the dataset yields a total of **725K** question-answer pairs, which are meticulously structured across the three cognitive layers.
### 1. Annotations by Cognitive Layer
The raw annotations are distributed across the three hierarchical stages to support multi-level reasoning:
| **Cognitive Layer** | **Total QA Pairs** | **Included Tasks** |
| ------------------- | ------------------ | ------------------------------------------------------------ |
| **Perception** | 276K | Emotion classification and valence recognition |
| **Understanding** | 330K | Visual attribution, contextual synthesis, and counterfactual reasoning |
| **Cognition** | 119K | Response intent and situational insight sequences (somatic, semantic, regulatory) |
| **Total** | **725K** | **The complete hierarchical annotation pool** |
### 2. Benchmark Splits
To support standardized training and evaluation, a curated subset of the annotations is formatted into the final benchmark splits (`train.jsonl` and `test.jsonl`):
| **Split** | **Images** | **QA Pairs** | **Purpose** |
| --------- | ---------- | ------------ | ------------------------------------------------------------ |
| **Train** | 124K | 653,292 | Generative QA format for foundational reasoning training |
| **Test** | 14K | 30,841 | Standardized discriminative tasks (MCQ/SJT) for fine-grained evaluation |
## Three-Tier Cognitive Architecture
InsightVQA formulates human-centered visual understanding through three progressively deeper stages:
### 1. Perception
Serves as the entry-level stage. It evaluates the model's ability to identify basic emotional states and valence from visual inputs.
### 2. Understanding
Elucidates *why* an emotion is perceived by grounding the reasoning process in verifiable visual evidence.
- **Visual Triggers:** Models must utilize appearance cues, scene cues, and agent cues.
- **Reasoning Types:** Includes visual attribution, contextual synthesis, and counterfactual reasoning.
### 3. Cognition
Focuses on higher-order cognitive reasoning and grounded response planning.
- **Response Intent:** Predicting the instinctive intent if encountering the scene.
- **Insight Sequences:** Evaluating the natural unfolding of somatic, semantic, and regulatory responses in a Situational Judgment Test format.
## Applications & Use Cases
- **Multimodal Large Language Models:** Benchmarking advanced reasoning capabilities, contextual dependencies, and cognitive-affective interactions.
- **Affective Computing:** Developing AI systems capable of deep emotion understanding for human-computer interaction and socially assistive robotics.
- **Visual Grounding:** Testing a model's ability to tie abstract emotional states to concrete, observable visual evidence.
提供机构:
ziyul707



