mlinhbng/viet-cultural-vqa

Name: mlinhbng/viet-cultural-vqa
Creator: mlinhbng
Published: 2025-12-08 09:00:27
License: 暂无描述

Hugging Face2025-12-08 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/mlinhbng/viet-cultural-vqa

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - vi - en license: apache-2.0 task_categories: - visual-question-answering - image-classification - object-detection task_ids: - visual-question-answering pretty_name: Vietnamese Cultural VQA Dataset size_categories: - 10K<n<100K tags: - vietnamese - cultural-heritage - visual-question-answering - multimodal - cultural-understanding - traditional-culture - southeast-asian configs: - config_name: default data_files: - split: train path: splits/train_data.json - split: validation path: splits/val_data.json - split: test path: splits/test_data.json --- # 🇻🇳 Vietnamese Cultural VQA Dataset ![Dataset Banner](https://img.shields.io/badge/Dataset-Vietnamese_Cultural_VQA-blue) ![License](https://img.shields.io/badge/License-Apache%202.0-green) ![Images](https://img.shields.io/badge/Images-28.5K-orange) ![QA Pairs](https://img.shields.io/badge/QA_Pairs-119K-red) ![Language](https://img.shields.io/badge/Language-Vietnamese-yellow) ## 📖 Dataset Description The **Vietnamese Cultural VQA Dataset** is a comprehensive multimodal dataset designed for Visual Question Answering (VQA) tasks focused on Vietnamese cultural heritage. This dataset aims to bridge the gap in understanding and preserving Vietnamese culture through AI-powered visual understanding and question answering. ### 🎯 Dataset Summary - **📊 Total Images**: 28,505 high-quality cultural images - **💬 Total QA Pairs**: 119,012 question-answer pairs - **🌍 Languages**: Vietnamese (primary), English (secondary) - **🏛️ Categories**: 12 major Vietnamese cultural domains - **📜 License**: Apache 2.0 - **📁 Format**: JSON with image references - **🎓 Quality**: 97.5% high-quality annotations ### 🚀 Supported Tasks - **Visual Question Answering (VQA)**: Answer questions about Vietnamese cultural images - **Image Classification**: Classify images into cultural categories - **Object Detection**: Identify cultural objects and elements - **Cultural Understanding**: Learn about Vietnamese traditions, customs, and heritage - **Multimodal Learning**: Combine vision and language for cultural comprehension - **Cross-lingual Transfer**: Vietnamese-English multimodal understanding --- ## 📂 Dataset Structure ### 💾 Data Instances Each instance in the dataset contains rich annotations: ```json { "image_id": "kien_truc_chua_mot_cot_000001", "image_path": "images/kien_truc/chua_mot_cot/000001.jpg", "category": "kien_truc", "keyword": "chùa một cột", "image_analysis": { "overall_description": "Hình ảnh chùa Một Cột, kiến trúc Phật giáo độc đáo...", "main_objects": ["chùa", "cột đá", "mái cong", "hồ nước"], "visual_details": { "colors": ["nâu gỗ", "xanh rêu", "vàng", "xanh nước"], "materials": ["gỗ", "đá", "ngói", "nước"], "composition": "Trung tâm là chùa trên cột đá giữa hồ sen", "setting": "Môi trường văn hóa lịch sử, Hà Nội", "cultural_identification": "Kiến trúc Phật giáo Việt Nam thời Lý" } }, "cultural_context": { "primary_cultural_objects": ["chùa Một Cột", "kiến trúc Lý"], "cultural_category": "Kiến trúc tôn giáo", "regional_significance": "Hà Nội, Bắc Bộ Việt Nam", "historical_context": "Xây dựng năm 1049 dưới triều vua Lý Thái Tông...", "modern_relevance": "Biểu tượng văn hóa Hà Nội, di sản quốc gia" }, "questions": [ { "question_id": 1, "question": "Đây là công trình kiến trúc nào?", "answer": "Chùa Một Cột", "detailed_explanation": "Chùa Một Cột là một trong những công trình kiến trúc độc đáo nhất...", "cultural_significance": "Biểu tượng văn hóa Việt Nam, di sản kiến trúc thời Lý", "difficulty": "easy", "question_type": "identification", "cognitive_level": "remember", "additional_context": { "origin": "Triều đại Lý, năm 1049", "usage": "Nơi thờ Phật, điểm tham quan văn hóa", "symbolism": "Hoa sen nở trên mặt nước - biểu tượng thanh tịnh", "regional_variations": "Độc nhất tại Hà Nội" } } ] } ``` ### 🔑 Data Fields | Field | Type | Description | |-------|------|-------------| | `image_id` | string | Unique identifier for each image | | `image` | Image | The image file (PIL Image object) | | `image_path` | string | Relative path to the image | | `category` | ClassLabel | One of 12 cultural categories | | `keyword` | string | Primary cultural keyword/object | | **image_analysis** | dict | Detailed image analysis | | ├─ `overall_description` | string | Comprehensive image description | | ├─ `main_objects` | list[string] | Key objects in the image | | └─ `visual_details` | dict | Colors, materials, composition, setting, cultural ID | | **cultural_context** | dict | Cultural background information | | ├─ `primary_cultural_objects` | list[string] | Main cultural elements | | ├─ `cultural_category` | string | Subcategory classification | | ├─ `regional_significance` | string | Geographic/regional context | | ├─ `historical_context` | string | Historical background | | └─ `modern_relevance` | string | Contemporary significance | | **questions** | list[dict] | List of Q&A pairs | | ├─ `question_id` | int | Question identifier | | ├─ `question` | string | The question text | | ├─ `answer` | string | The answer text | | ├─ `detailed_explanation` | string | Comprehensive explanation | | ├─ `cultural_significance` | string | Cultural importance | | ├─ `difficulty` | string | easy, medium, or hard | | ├─ `question_type` | string | identification, description, cultural, analysis, comparison | | ├─ `cognitive_level` | string | remember, understand, apply, analyze, evaluate (Bloom's Taxonomy) | | └─ `additional_context` | dict | origin, usage, symbolism, regional_variations | ### 📊 Data Splits | Split | Samples | QA Pairs | Percentage | Size | |-------|---------|----------|------------|------| | **Train** | 18,806 | ~89,400 | 75% | ~195 MB | | **Validation** | 3,761 | ~17,900 | 15% | ~24 MB | | **Test** | 2,507 | ~11,900 | 10% | ~25 MB | | **Total** | **25,074** | **119,012** | 100% | **~244 MB** | --- ## 🏛️ Dataset Categories The dataset covers **12 major Vietnamese cultural domains**: | # | Category | Vietnamese Name | Description | Images | Keywords | |---|----------|----------------|-------------|--------|----------| | 1 | **Architecture** | Kiến trúc | Temples, pagodas, traditional houses, palaces | 2,979 | chùa, đền, nhà rường, lăng | | 2 | **Cuisine** | Ẩm thực | Traditional dishes, street food, ingredients | ~2,500 | phở, bánh mì, bún, chả | | 3 | **Landscapes** | Phong cảnh | Natural heritage, scenic spots, landmarks | 2,929 | Hạ Long, Sapa, đồng ruộng | | 4 | **Clothing** | Trang phục | Áo dài, ethnic costumes, traditional attire | 2,485 | áo dài, áo tứ thân, trang phục dân tộc | | 5 | **Daily Life** | Đời sống hàng ngày | Markets, street scenes, everyday activities | 2,493 | chợ, phố cổ, sinh hoạt | | 6 | **Folk Culture** | Văn hóa dân gian | Water puppetry, folk arts, traditional performances | 1,969 | múa rối nước, hát chèo, ca trù | | 7 | **Festivals** | Lễ hội | Traditional celebrations, ceremonies, rituals | 2,387 | Tết, lễ hội đền, rước kiệu | | 8 | **Traditional Games** | Trò chơi dân gian | Folk games, children's games | 2,469 | đánh đu, kéo co, ô ăn quan | | 9 | **Traditional Sports** | Thể thao truyền thống | Martial arts, traditional sports | 2,439 | võ cổ truyền, đua thuyền | | 10 | **Handicrafts** | Thủ công mỹ nghệ | Ceramics, lacquerware, silk, bamboo crafts | 1,986 | gốm sứ, sơn mài, tơ tằm | | 11 | **Music** | Nhạc cụ | Traditional Vietnamese instruments | 1,453 | đàn tranh, đàn bầu, sáo trúc | | 12 | **Transportation** | Giao thông | Cyclos, sampans, traditional vehicles | 1,485 | xích lô, thuyền, ghe | --- ## 📈 Dataset Statistics ### 🎯 Question Analysis **Difficulty Distribution:** - 🟢 **Easy**: 25,162 (21.1%) - Basic identification and recognition - 🟡 **Medium**: 46,441 (39.0%) - Description and understanding - 🔴 **Hard**: 47,409 (39.8%) - Analysis and cultural insight **Question Types:** - 🔍 **Identification**: 24,892 (20.9%) - "What is this?" - 📝 **Description**: 22,252 (18.7%) - "Describe the image" - 🏛️ **Cultural**: 23,969 (20.1%) - "What is the cultural significance?" - 🧠 **Analysis**: 23,982 (20.1%) - "Why is this important?" - ⚖️ **Comparison**: 23,889 (20.1%) - "How does this compare?" **Cognitive Levels (Bloom's Taxonomy):** - 💭 **Remember**: 24,842 (20.9%) - Recall facts - 🧩 **Understand**: 25,794 (21.7%) - Explain concepts - 🛠️ **Apply**: 19,747 (16.6%) - Use knowledge - 🔬 **Analyze**: 26,564 (22.3%) - Break down info - ⭐ **Evaluate**: 22,018 (18.5%) - Make judgments ### ✅ Quality Metrics - **High Quality Annotations**: 24,446 samples (97.5%) - **AI-Assisted Annotations**: 628 samples (2.5%) - **Average Explanation Length**: 295 characters - **Average Questions per Image**: 4.75 - **Cultural Expert Validation**: Yes --- ## 💻 Usage ### 🔧 Installation ```bash pip install datasets pillow ``` ### 📥 Load the Dataset ```python from datasets import load_dataset # Load the full dataset dataset = load_dataset("Dangindev/viet-cultural-vqa") # Load specific splits train_data = load_dataset("Dangindev/viet-cultural-vqa", split="train") val_data = load_dataset("Dangindev/viet-cultural-vqa", split="validation") test_data = load_dataset("Dangindev/viet-cultural-vqa", split="test") # Access a sample sample = dataset["train"][0] print(f"Image ID: {sample['image_id']}") print(f"Category: {sample['category']}") print(f"Question: {sample['questions'][0]['question']}") print(f"Answer: {sample['questions'][0]['answer']}") # Display image sample['image'].show() ``` ### 🔍 Filtering by Category ```python # Filter architecture images architecture = dataset["train"].filter( lambda x: x["category"] == 1 # kien_truc ) # Filter by difficulty hard_questions = dataset["train"].filter( lambda x: any(q["difficulty"] == "hard" for q in x["questions"]) ) # Filter by question type cultural_questions = dataset["train"].filter( lambda x: any(q["question_type"] == "cultural" for q in x["questions"]) ) ``` ### 🤖 Training a VQA Model ```python from transformers import ViltProcessor, ViltForQuestionAnswering from torch.utils.data import DataLoader import torch # Load model and processor processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa") model = ViltForQuestionAnswering.from_pretrained("dandelin/vilt-b32-finetuned-vqa") # Prepare dataset def preprocess_function(examples): images = [] questions = [] for img, qs in zip(examples["image"], examples["questions"]): for q in qs: images.append(img) questions.append(q["question"]) encoding = processor(images, questions, padding="max_length", truncation=True, return_tensors="pt") return encoding # Process dataset processed_dataset = dataset["train"].map( preprocess_function, batched=True, remove_columns=dataset["train"].column_names ) # Create dataloader train_dataloader = DataLoader(processed_dataset, batch_size=8, shuffle=True) # Training loop (simplified) optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5) model.train() for batch in train_dataloader: outputs = model(**batch) loss = outputs.loss loss.backward() optimizer.step() optimizer.zero_grad() ``` ### 📊 Data Analysis ```python import pandas as pd # Analyze category distribution categories = [sample["category"] for sample in dataset["train"]] pd.Series(categories).value_counts() # Analyze question difficulty difficulties = [] for sample in dataset["train"]: for q in sample["questions"]: difficulties.append(q["difficulty"]) pd.Series(difficulties).value_counts() # Average questions per image avg_questions = sum(len(s["questions"]) for s in dataset["train"]) / len(dataset["train"]) print(f"Average questions per image: {avg_questions:.2f}") ``` --- ## 🛠️ Data Collection and Annotation ### 📸 Image Collection Images were collected from multiple sources: - ✅ Public domain Vietnamese cultural archives - ✅ Creative Commons licensed photographs - ✅ Curated web crawling with cultural keywords - ✅ Collaborative contributions from cultural experts - ✅ Vietnamese tourism and heritage websites ### ✍️ Annotation Process 1. **Image Analysis** (Automated) - Google Gemini Vision API for initial analysis - Object detection and scene understanding 2. **Cultural Context** (Expert-guided) - Vietnamese cultural experts review and enrich annotations - Historical and regional context added 3. **Question Generation** (AI + Human) - AI-assisted question generation with templates - Human review and refinement - Multiple cognitive levels (Bloom's Taxonomy) 4. **Quality Control** (Multi-stage) - Automated validation checks - Expert review of samples - Community feedback integration 5. **Cultural Verification** - Review by Vietnamese cultural experts - Regional variations documented - Historical accuracy ensured ### 📋 Annotation Guidelines - ✅ Questions cover multiple cognitive levels - ✅ Answers include detailed cultural explanations - ✅ Focus on authenticity and cultural accuracy - ✅ Bilingual support (Vietnamese primary) - ✅ Regional diversity representation - ✅ Respect for cultural sensitivity --- ## 🤝 Ethical Considerations ### 🌏 Cultural Sensitivity - All images and annotations respect Vietnamese cultural heritage - Traditional knowledge presented with appropriate context - Regional variations acknowledged and documented - No stereotyping or cultural appropriation - Consultation with Vietnamese cultural experts ### 🔒 Privacy - No personal identifying information in images - Public spaces and cultural artifacts only - Consent obtained where applicable - No sensitive or private cultural practices ### ⚖️ Bias Mitigation - Balanced representation across regions (North, Central, South Vietnam) - Diverse cultural categories to avoid stereotyping - Multiple perspectives on cultural practices - Gender and age diversity in depicted subjects - Urban and rural representation --- ## ⚠️ Limitations - **Geographic Coverage**: Some remote regions may be underrepresented - **Historical Depth**: Focus on contemporary and recent culture (post-20th century) - **Language**: Primary content in Vietnamese; English translations may vary in quality - **Automation**: Some annotations generated by AI and may contain minor errors - **Cultural Nuance**: Complex cultural concepts may be simplified for accessibility - **Image Quality**: Varies based on source (mostly high quality, some moderate) - **Temporal Coverage**: Modern images; historical period images limited --- ## 📚 Citation If you use this dataset in your research, please cite: ```bibtex @misc{VietMEAgent, title={VietMEAgent: Culturally-Aware Few-Shot Multimodal Explanation for Vietnamese Visual Question Answering}, author={Hai-Dang Nguyen and Minh-Anh Dang and Minh-Tan Le and Minh-Tuan Le}, year={2025}, eprint={2511.09058}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2511.09058}, } ``` --- ## 📄 License This dataset is licensed under the **Apache License 2.0**. ✅ **You are free to:** - Share: copy and redistribute the material - Adapt: remix, transform, and build upon the material - Commercial use: use the material for commercial purposes ⚠️ **Under the following terms:** - Attribution: provide appropriate credit and indicate changes - No additional restrictions: no legal/technological measures that restrict others See [LICENSE](https://www.apache.org/licenses/LICENSE-2.0) for full details. --- ## 👥 Dataset Maintainers - **Team**: VietMeAgent Team - **Contact**: [Dangindev on Hugging Face](https://huggingface.co/Dangindev) - **Repository**: https://huggingface.co/datasets/Dangindev/viet-cultural-vqa - **Issues**: Please report issues on the repository --- ## 📝 Changelog ### Version 1.0.0 (October 2024) - ✨ Initial release - 📊 28,505 images across 12 cultural categories - 💬 119,012 question-answer pairs - 🏛️ Multi-level annotations with rich cultural context - 📂 Train/validation/test splits (75/15/10) - 🔧 HuggingFace datasets integration - 📖 Comprehensive documentation --- ## 🙏 Acknowledgments We thank: - 🇻🇳 Vietnamese cultural experts for validation and guidance - 🌐 Open-source community for tools and frameworks - 🤗 Hugging Face for hosting and infrastructure - 👥 Contributors who helped curate and validate the dataset - 🏛️ Vietnamese heritage organizations for support - 📚 Academic institutions for collaboration --- ## 🔮 Future Work - 🌟 Expand to more granular subcategories - ⏳ Add temporal evolution tracking (historical changes) - 🔊 Include audio descriptions for accessibility - 🌍 Multilingual expansion (French, Chinese, Japanese) - 🤝 Interactive annotation tool for community contributions - 📹 Video annotations for dynamic cultural practices - 🗺️ Geographic metadata and mapping - 🎓 Educational curriculum integration --- ## 🏷️ Keywords `Vietnamese culture` • `Visual Question Answering` • `Multimodal Learning` • `Cultural Heritage` • `Traditional Culture` • `Southeast Asian AI` • `Cultural Understanding` • `VQA Dataset` • `Image Classification` • `Vietnamese Language` • `Cultural Preservation` • `AI for Heritage` • `Multimodal Dataset` • `Computer Vision` • `Natural Language Processing` --- **⭐ If you find this dataset useful, please give it a star and cite it in your work!**

提供机构：

mlinhbng

5,000+

优质数据集

54 个

任务类型

进入经典数据集