mlinhbng/viet-cultural-vqa
收藏Hugging Face2025-12-08 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/mlinhbng/viet-cultural-vqa
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- vi
- en
license: apache-2.0
task_categories:
- visual-question-answering
- image-classification
- object-detection
task_ids:
- visual-question-answering
pretty_name: Vietnamese Cultural VQA Dataset
size_categories:
- 10K<n<100K
tags:
- vietnamese
- cultural-heritage
- visual-question-answering
- multimodal
- cultural-understanding
- traditional-culture
- southeast-asian
configs:
- config_name: default
data_files:
- split: train
path: splits/train_data.json
- split: validation
path: splits/val_data.json
- split: test
path: splits/test_data.json
---
# 🇻🇳 Vietnamese Cultural VQA Dataset





## 📖 Dataset Description
The **Vietnamese Cultural VQA Dataset** is a comprehensive multimodal dataset designed for Visual Question Answering (VQA) tasks focused on Vietnamese cultural heritage. This dataset aims to bridge the gap in understanding and preserving Vietnamese culture through AI-powered visual understanding and question answering.
### 🎯 Dataset Summary
- **📊 Total Images**: 28,505 high-quality cultural images
- **💬 Total QA Pairs**: 119,012 question-answer pairs
- **🌍 Languages**: Vietnamese (primary), English (secondary)
- **🏛️ Categories**: 12 major Vietnamese cultural domains
- **📜 License**: Apache 2.0
- **📁 Format**: JSON with image references
- **🎓 Quality**: 97.5% high-quality annotations
### 🚀 Supported Tasks
- **Visual Question Answering (VQA)**: Answer questions about Vietnamese cultural images
- **Image Classification**: Classify images into cultural categories
- **Object Detection**: Identify cultural objects and elements
- **Cultural Understanding**: Learn about Vietnamese traditions, customs, and heritage
- **Multimodal Learning**: Combine vision and language for cultural comprehension
- **Cross-lingual Transfer**: Vietnamese-English multimodal understanding
---
## 📂 Dataset Structure
### 💾 Data Instances
Each instance in the dataset contains rich annotations:
```json
{
"image_id": "kien_truc_chua_mot_cot_000001",
"image_path": "images/kien_truc/chua_mot_cot/000001.jpg",
"category": "kien_truc",
"keyword": "chùa một cột",
"image_analysis": {
"overall_description": "Hình ảnh chùa Một Cột, kiến trúc Phật giáo độc đáo...",
"main_objects": ["chùa", "cột đá", "mái cong", "hồ nước"],
"visual_details": {
"colors": ["nâu gỗ", "xanh rêu", "vàng", "xanh nước"],
"materials": ["gỗ", "đá", "ngói", "nước"],
"composition": "Trung tâm là chùa trên cột đá giữa hồ sen",
"setting": "Môi trường văn hóa lịch sử, Hà Nội",
"cultural_identification": "Kiến trúc Phật giáo Việt Nam thời Lý"
}
},
"cultural_context": {
"primary_cultural_objects": ["chùa Một Cột", "kiến trúc Lý"],
"cultural_category": "Kiến trúc tôn giáo",
"regional_significance": "Hà Nội, Bắc Bộ Việt Nam",
"historical_context": "Xây dựng năm 1049 dưới triều vua Lý Thái Tông...",
"modern_relevance": "Biểu tượng văn hóa Hà Nội, di sản quốc gia"
},
"questions": [
{
"question_id": 1,
"question": "Đây là công trình kiến trúc nào?",
"answer": "Chùa Một Cột",
"detailed_explanation": "Chùa Một Cột là một trong những công trình kiến trúc độc đáo nhất...",
"cultural_significance": "Biểu tượng văn hóa Việt Nam, di sản kiến trúc thời Lý",
"difficulty": "easy",
"question_type": "identification",
"cognitive_level": "remember",
"additional_context": {
"origin": "Triều đại Lý, năm 1049",
"usage": "Nơi thờ Phật, điểm tham quan văn hóa",
"symbolism": "Hoa sen nở trên mặt nước - biểu tượng thanh tịnh",
"regional_variations": "Độc nhất tại Hà Nội"
}
}
]
}
```
### 🔑 Data Fields
| Field | Type | Description |
|-------|------|-------------|
| `image_id` | string | Unique identifier for each image |
| `image` | Image | The image file (PIL Image object) |
| `image_path` | string | Relative path to the image |
| `category` | ClassLabel | One of 12 cultural categories |
| `keyword` | string | Primary cultural keyword/object |
| **image_analysis** | dict | Detailed image analysis |
| ├─ `overall_description` | string | Comprehensive image description |
| ├─ `main_objects` | list[string] | Key objects in the image |
| └─ `visual_details` | dict | Colors, materials, composition, setting, cultural ID |
| **cultural_context** | dict | Cultural background information |
| ├─ `primary_cultural_objects` | list[string] | Main cultural elements |
| ├─ `cultural_category` | string | Subcategory classification |
| ├─ `regional_significance` | string | Geographic/regional context |
| ├─ `historical_context` | string | Historical background |
| └─ `modern_relevance` | string | Contemporary significance |
| **questions** | list[dict] | List of Q&A pairs |
| ├─ `question_id` | int | Question identifier |
| ├─ `question` | string | The question text |
| ├─ `answer` | string | The answer text |
| ├─ `detailed_explanation` | string | Comprehensive explanation |
| ├─ `cultural_significance` | string | Cultural importance |
| ├─ `difficulty` | string | easy, medium, or hard |
| ├─ `question_type` | string | identification, description, cultural, analysis, comparison |
| ├─ `cognitive_level` | string | remember, understand, apply, analyze, evaluate (Bloom's Taxonomy) |
| └─ `additional_context` | dict | origin, usage, symbolism, regional_variations |
### 📊 Data Splits
| Split | Samples | QA Pairs | Percentage | Size |
|-------|---------|----------|------------|------|
| **Train** | 18,806 | ~89,400 | 75% | ~195 MB |
| **Validation** | 3,761 | ~17,900 | 15% | ~24 MB |
| **Test** | 2,507 | ~11,900 | 10% | ~25 MB |
| **Total** | **25,074** | **119,012** | 100% | **~244 MB** |
---
## 🏛️ Dataset Categories
The dataset covers **12 major Vietnamese cultural domains**:
| # | Category | Vietnamese Name | Description | Images | Keywords |
|---|----------|----------------|-------------|--------|----------|
| 1 | **Architecture** | Kiến trúc | Temples, pagodas, traditional houses, palaces | 2,979 | chùa, đền, nhà rường, lăng |
| 2 | **Cuisine** | Ẩm thực | Traditional dishes, street food, ingredients | ~2,500 | phở, bánh mì, bún, chả |
| 3 | **Landscapes** | Phong cảnh | Natural heritage, scenic spots, landmarks | 2,929 | Hạ Long, Sapa, đồng ruộng |
| 4 | **Clothing** | Trang phục | Áo dài, ethnic costumes, traditional attire | 2,485 | áo dài, áo tứ thân, trang phục dân tộc |
| 5 | **Daily Life** | Đời sống hàng ngày | Markets, street scenes, everyday activities | 2,493 | chợ, phố cổ, sinh hoạt |
| 6 | **Folk Culture** | Văn hóa dân gian | Water puppetry, folk arts, traditional performances | 1,969 | múa rối nước, hát chèo, ca trù |
| 7 | **Festivals** | Lễ hội | Traditional celebrations, ceremonies, rituals | 2,387 | Tết, lễ hội đền, rước kiệu |
| 8 | **Traditional Games** | Trò chơi dân gian | Folk games, children's games | 2,469 | đánh đu, kéo co, ô ăn quan |
| 9 | **Traditional Sports** | Thể thao truyền thống | Martial arts, traditional sports | 2,439 | võ cổ truyền, đua thuyền |
| 10 | **Handicrafts** | Thủ công mỹ nghệ | Ceramics, lacquerware, silk, bamboo crafts | 1,986 | gốm sứ, sơn mài, tơ tằm |
| 11 | **Music** | Nhạc cụ | Traditional Vietnamese instruments | 1,453 | đàn tranh, đàn bầu, sáo trúc |
| 12 | **Transportation** | Giao thông | Cyclos, sampans, traditional vehicles | 1,485 | xích lô, thuyền, ghe |
---
## 📈 Dataset Statistics
### 🎯 Question Analysis
**Difficulty Distribution:**
- 🟢 **Easy**: 25,162 (21.1%) - Basic identification and recognition
- 🟡 **Medium**: 46,441 (39.0%) - Description and understanding
- 🔴 **Hard**: 47,409 (39.8%) - Analysis and cultural insight
**Question Types:**
- 🔍 **Identification**: 24,892 (20.9%) - "What is this?"
- 📝 **Description**: 22,252 (18.7%) - "Describe the image"
- 🏛️ **Cultural**: 23,969 (20.1%) - "What is the cultural significance?"
- 🧠 **Analysis**: 23,982 (20.1%) - "Why is this important?"
- ⚖️ **Comparison**: 23,889 (20.1%) - "How does this compare?"
**Cognitive Levels (Bloom's Taxonomy):**
- 💭 **Remember**: 24,842 (20.9%) - Recall facts
- 🧩 **Understand**: 25,794 (21.7%) - Explain concepts
- 🛠️ **Apply**: 19,747 (16.6%) - Use knowledge
- 🔬 **Analyze**: 26,564 (22.3%) - Break down info
- ⭐ **Evaluate**: 22,018 (18.5%) - Make judgments
### ✅ Quality Metrics
- **High Quality Annotations**: 24,446 samples (97.5%)
- **AI-Assisted Annotations**: 628 samples (2.5%)
- **Average Explanation Length**: 295 characters
- **Average Questions per Image**: 4.75
- **Cultural Expert Validation**: Yes
---
## 💻 Usage
### 🔧 Installation
```bash
pip install datasets pillow
```
### 📥 Load the Dataset
```python
from datasets import load_dataset
# Load the full dataset
dataset = load_dataset("Dangindev/viet-cultural-vqa")
# Load specific splits
train_data = load_dataset("Dangindev/viet-cultural-vqa", split="train")
val_data = load_dataset("Dangindev/viet-cultural-vqa", split="validation")
test_data = load_dataset("Dangindev/viet-cultural-vqa", split="test")
# Access a sample
sample = dataset["train"][0]
print(f"Image ID: {sample['image_id']}")
print(f"Category: {sample['category']}")
print(f"Question: {sample['questions'][0]['question']}")
print(f"Answer: {sample['questions'][0]['answer']}")
# Display image
sample['image'].show()
```
### 🔍 Filtering by Category
```python
# Filter architecture images
architecture = dataset["train"].filter(
lambda x: x["category"] == 1 # kien_truc
)
# Filter by difficulty
hard_questions = dataset["train"].filter(
lambda x: any(q["difficulty"] == "hard" for q in x["questions"])
)
# Filter by question type
cultural_questions = dataset["train"].filter(
lambda x: any(q["question_type"] == "cultural" for q in x["questions"])
)
```
### 🤖 Training a VQA Model
```python
from transformers import ViltProcessor, ViltForQuestionAnswering
from torch.utils.data import DataLoader
import torch
# Load model and processor
processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
model = ViltForQuestionAnswering.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
# Prepare dataset
def preprocess_function(examples):
images = []
questions = []
for img, qs in zip(examples["image"], examples["questions"]):
for q in qs:
images.append(img)
questions.append(q["question"])
encoding = processor(images, questions, padding="max_length", truncation=True, return_tensors="pt")
return encoding
# Process dataset
processed_dataset = dataset["train"].map(
preprocess_function,
batched=True,
remove_columns=dataset["train"].column_names
)
# Create dataloader
train_dataloader = DataLoader(processed_dataset, batch_size=8, shuffle=True)
# Training loop (simplified)
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
model.train()
for batch in train_dataloader:
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
```
### 📊 Data Analysis
```python
import pandas as pd
# Analyze category distribution
categories = [sample["category"] for sample in dataset["train"]]
pd.Series(categories).value_counts()
# Analyze question difficulty
difficulties = []
for sample in dataset["train"]:
for q in sample["questions"]:
difficulties.append(q["difficulty"])
pd.Series(difficulties).value_counts()
# Average questions per image
avg_questions = sum(len(s["questions"]) for s in dataset["train"]) / len(dataset["train"])
print(f"Average questions per image: {avg_questions:.2f}")
```
---
## 🛠️ Data Collection and Annotation
### 📸 Image Collection
Images were collected from multiple sources:
- ✅ Public domain Vietnamese cultural archives
- ✅ Creative Commons licensed photographs
- ✅ Curated web crawling with cultural keywords
- ✅ Collaborative contributions from cultural experts
- ✅ Vietnamese tourism and heritage websites
### ✍️ Annotation Process
1. **Image Analysis** (Automated)
- Google Gemini Vision API for initial analysis
- Object detection and scene understanding
2. **Cultural Context** (Expert-guided)
- Vietnamese cultural experts review and enrich annotations
- Historical and regional context added
3. **Question Generation** (AI + Human)
- AI-assisted question generation with templates
- Human review and refinement
- Multiple cognitive levels (Bloom's Taxonomy)
4. **Quality Control** (Multi-stage)
- Automated validation checks
- Expert review of samples
- Community feedback integration
5. **Cultural Verification**
- Review by Vietnamese cultural experts
- Regional variations documented
- Historical accuracy ensured
### 📋 Annotation Guidelines
- ✅ Questions cover multiple cognitive levels
- ✅ Answers include detailed cultural explanations
- ✅ Focus on authenticity and cultural accuracy
- ✅ Bilingual support (Vietnamese primary)
- ✅ Regional diversity representation
- ✅ Respect for cultural sensitivity
---
## 🤝 Ethical Considerations
### 🌏 Cultural Sensitivity
- All images and annotations respect Vietnamese cultural heritage
- Traditional knowledge presented with appropriate context
- Regional variations acknowledged and documented
- No stereotyping or cultural appropriation
- Consultation with Vietnamese cultural experts
### 🔒 Privacy
- No personal identifying information in images
- Public spaces and cultural artifacts only
- Consent obtained where applicable
- No sensitive or private cultural practices
### ⚖️ Bias Mitigation
- Balanced representation across regions (North, Central, South Vietnam)
- Diverse cultural categories to avoid stereotyping
- Multiple perspectives on cultural practices
- Gender and age diversity in depicted subjects
- Urban and rural representation
---
## ⚠️ Limitations
- **Geographic Coverage**: Some remote regions may be underrepresented
- **Historical Depth**: Focus on contemporary and recent culture (post-20th century)
- **Language**: Primary content in Vietnamese; English translations may vary in quality
- **Automation**: Some annotations generated by AI and may contain minor errors
- **Cultural Nuance**: Complex cultural concepts may be simplified for accessibility
- **Image Quality**: Varies based on source (mostly high quality, some moderate)
- **Temporal Coverage**: Modern images; historical period images limited
---
## 📚 Citation
If you use this dataset in your research, please cite:
```bibtex
@misc{VietMEAgent,
title={VietMEAgent: Culturally-Aware Few-Shot Multimodal Explanation for Vietnamese Visual Question Answering},
author={Hai-Dang Nguyen and Minh-Anh Dang and Minh-Tan Le and Minh-Tuan Le},
year={2025},
eprint={2511.09058},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.09058},
}
```
---
## 📄 License
This dataset is licensed under the **Apache License 2.0**.
✅ **You are free to:**
- Share: copy and redistribute the material
- Adapt: remix, transform, and build upon the material
- Commercial use: use the material for commercial purposes
⚠️ **Under the following terms:**
- Attribution: provide appropriate credit and indicate changes
- No additional restrictions: no legal/technological measures that restrict others
See [LICENSE](https://www.apache.org/licenses/LICENSE-2.0) for full details.
---
## 👥 Dataset Maintainers
- **Team**: VietMeAgent Team
- **Contact**: [Dangindev on Hugging Face](https://huggingface.co/Dangindev)
- **Repository**: https://huggingface.co/datasets/Dangindev/viet-cultural-vqa
- **Issues**: Please report issues on the repository
---
## 📝 Changelog
### Version 1.0.0 (October 2024)
- ✨ Initial release
- 📊 28,505 images across 12 cultural categories
- 💬 119,012 question-answer pairs
- 🏛️ Multi-level annotations with rich cultural context
- 📂 Train/validation/test splits (75/15/10)
- 🔧 HuggingFace datasets integration
- 📖 Comprehensive documentation
---
## 🙏 Acknowledgments
We thank:
- 🇻🇳 Vietnamese cultural experts for validation and guidance
- 🌐 Open-source community for tools and frameworks
- 🤗 Hugging Face for hosting and infrastructure
- 👥 Contributors who helped curate and validate the dataset
- 🏛️ Vietnamese heritage organizations for support
- 📚 Academic institutions for collaboration
---
## 🔮 Future Work
- 🌟 Expand to more granular subcategories
- ⏳ Add temporal evolution tracking (historical changes)
- 🔊 Include audio descriptions for accessibility
- 🌍 Multilingual expansion (French, Chinese, Japanese)
- 🤝 Interactive annotation tool for community contributions
- 📹 Video annotations for dynamic cultural practices
- 🗺️ Geographic metadata and mapping
- 🎓 Educational curriculum integration
---
## 🏷️ Keywords
`Vietnamese culture` • `Visual Question Answering` • `Multimodal Learning` • `Cultural Heritage` • `Traditional Culture` • `Southeast Asian AI` • `Cultural Understanding` • `VQA Dataset` • `Image Classification` • `Vietnamese Language` • `Cultural Preservation` • `AI for Heritage` • `Multimodal Dataset` • `Computer Vision` • `Natural Language Processing`
---
**⭐ If you find this dataset useful, please give it a star and cite it in your work!**
提供机构:
mlinhbng



