DiegoAlysson/Translated_Expanded_CC3M-Brazilian_Portuguese-Hindi-Xhosa
收藏Hugging Face2025-12-05 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/DiegoAlysson/Translated_Expanded_CC3M-Brazilian_Portuguese-Hindi-Xhosa
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
---
# CC3M Multilingual & Augmented Variants
This repository provides four multilingual, augmented, and similarity-enhanced variants of the **Conceptual Captions 3M (CC3M)** dataset.
The goal is to support research in vision–language modeling, multimodal alignment, data augmentation, and low-resource language evaluation.
All versions include translations generated with **Google Translate** and **MarianMT**, and caption augmentations produced with **BLIP2**, generating **five additional captions per image**.
Some versions also include similarity scores and CLIP-based filtering.
---
## Dataset Versions
### **Translated_Expanded_CC3M-Brazilian_Portuguese-Hindi-Xhosa.csv**
### **1. `cc3m_blip2_augment_low_resource`**
A version designed for *low-resource* languages.
- CC3M translated into **Portuguese, Hindi, and Xhosa**
- BLIP2 augmentations translated into **Hindi and Xhosa**
- Includes 1 original caption + 5 augmented captions
- Useful for cross-lingual and low-resource multimodal training
---
### **2. `cc3m_blip2_augment_translated_sim`**
A multilingual, augmented version with similarity metadata.
- CC3M translated into **English and Portuguese**
- 5 BLIP2 augmentations per image
- Cosine similarity scores for:
- image × original caption
- image × augmented captions
- Supports multimodal alignment evaluation and curriculum learning
---
### **3. `cc3m_filtered_blip2_augment_translated_sim`**
A quality-filtered version of the dataset.
- Translated into **English and Portuguese**
- Filtered with **CLIP Score ≥ 0.2**
- Includes BLIP2 augmentations and cosine similarity values
- Higher precision image–text pairs for more robust training
---
### **4. `cc3m_laclip`**
A dataset augmented using **LaCLIP** instead of BLIP2.
- Augmentation exclusively with LaCLIP
- Focused on **Portuguese**
- Includes original caption + LaCLIP-generated captions
- Ideal for studies involving LaCLIP-based captioning
---
### **Translated_Expanded_CC3M-Brazilian_Portuguese-Validation.csv**
### **2. `cc3m_val`**
A multilingual, CC3M validation set.
- CC3M translated into **English and Portuguese** for Validation set
---
## Methodology
### **Translations**
All captions were translated using:
- **Google Translate API**
- **MarianMT (Helsinki-NLP)**
This dual-translation setup supports comparative linguistic analysis.
### **Caption Augmentation**
- **BLIP2** generated *five new captions per image*
- **LaCLIP** used in one version (`cc3m_laclip`)
- Augmentations were translated into target languages where relevant (e.g., Hindi, Xhosa)
### **Filtering**
Only the version `cc3m_filtered_blip2_augment_translated_sim` applies filtering:
- **CLIP Score ≥ 0.2**
- Helps remove noisy or mismatched image–caption pairs
### **Similarity Scores**
Some versions provide cosine similarity values for:
- image × original caption
- image × augmented captions
These metrics are useful for:
- data quality control
- sample reweighting
- multimodal consistency analysis
---
## Comparison Table
| Feature / Dataset Version | cc3m_blip2_augment_low_resource | cc3m_blip2_augment_translated_sim | cc3m_filtered_blip2_augment_translated_sim | cc3m_laclip |
|---------------------------|---------------------------------|-----------------------------------|--------------------------------------------|-------------|
| **Languages** | PT, HI, XH | EN, PT | EN, PT | PT |
| **Translation Methods** | Google + MarianMT | Google + MarianMT | Google + MarianMT | Google + MarianMT |
| **Augmentation Model** | BLIP2 | BLIP2 | BLIP2 | LaCLIP |
| **Augmentation Count** | 5 | 5 | 5 | Variable |
| **Augmentations Translated** | HI, XH | PT | PT | PT |
| **Cosine Similarity** | ❌ | ✔️ | ✔️ | ❌ |
| **CLIP Filtering** | ❌ | ❌ | ✔️ (≥ 0.2) | ❌ |
| **Target Purpose** | Low-resource training | Multilingual augmentation + similarity | High-quality filtered dataset | LaCLIP augmentation studies |
提供机构:
DiegoAlysson



