five

DiegoAlysson/Translated_Expanded_CC3M-Brazilian_Portuguese-Hindi-Xhosa

收藏
Hugging Face2025-12-05 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/DiegoAlysson/Translated_Expanded_CC3M-Brazilian_Portuguese-Hindi-Xhosa
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit --- # CC3M Multilingual & Augmented Variants This repository provides four multilingual, augmented, and similarity-enhanced variants of the **Conceptual Captions 3M (CC3M)** dataset. The goal is to support research in vision–language modeling, multimodal alignment, data augmentation, and low-resource language evaluation. All versions include translations generated with **Google Translate** and **MarianMT**, and caption augmentations produced with **BLIP2**, generating **five additional captions per image**. Some versions also include similarity scores and CLIP-based filtering. --- ## Dataset Versions ### **Translated_Expanded_CC3M-Brazilian_Portuguese-Hindi-Xhosa.csv** ### **1. `cc3m_blip2_augment_low_resource`** A version designed for *low-resource* languages. - CC3M translated into **Portuguese, Hindi, and Xhosa** - BLIP2 augmentations translated into **Hindi and Xhosa** - Includes 1 original caption + 5 augmented captions - Useful for cross-lingual and low-resource multimodal training --- ### **2. `cc3m_blip2_augment_translated_sim`** A multilingual, augmented version with similarity metadata. - CC3M translated into **English and Portuguese** - 5 BLIP2 augmentations per image - Cosine similarity scores for: - image × original caption - image × augmented captions - Supports multimodal alignment evaluation and curriculum learning --- ### **3. `cc3m_filtered_blip2_augment_translated_sim`** A quality-filtered version of the dataset. - Translated into **English and Portuguese** - Filtered with **CLIP Score ≥ 0.2** - Includes BLIP2 augmentations and cosine similarity values - Higher precision image–text pairs for more robust training --- ### **4. `cc3m_laclip`** A dataset augmented using **LaCLIP** instead of BLIP2. - Augmentation exclusively with LaCLIP - Focused on **Portuguese** - Includes original caption + LaCLIP-generated captions - Ideal for studies involving LaCLIP-based captioning --- ### **Translated_Expanded_CC3M-Brazilian_Portuguese-Validation.csv** ### **2. `cc3m_val`** A multilingual, CC3M validation set. - CC3M translated into **English and Portuguese** for Validation set --- ## Methodology ### **Translations** All captions were translated using: - **Google Translate API** - **MarianMT (Helsinki-NLP)** This dual-translation setup supports comparative linguistic analysis. ### **Caption Augmentation** - **BLIP2** generated *five new captions per image* - **LaCLIP** used in one version (`cc3m_laclip`) - Augmentations were translated into target languages where relevant (e.g., Hindi, Xhosa) ### **Filtering** Only the version `cc3m_filtered_blip2_augment_translated_sim` applies filtering: - **CLIP Score ≥ 0.2** - Helps remove noisy or mismatched image–caption pairs ### **Similarity Scores** Some versions provide cosine similarity values for: - image × original caption - image × augmented captions These metrics are useful for: - data quality control - sample reweighting - multimodal consistency analysis --- ## Comparison Table | Feature / Dataset Version | cc3m_blip2_augment_low_resource | cc3m_blip2_augment_translated_sim | cc3m_filtered_blip2_augment_translated_sim | cc3m_laclip | |---------------------------|---------------------------------|-----------------------------------|--------------------------------------------|-------------| | **Languages** | PT, HI, XH | EN, PT | EN, PT | PT | | **Translation Methods** | Google + MarianMT | Google + MarianMT | Google + MarianMT | Google + MarianMT | | **Augmentation Model** | BLIP2 | BLIP2 | BLIP2 | LaCLIP | | **Augmentation Count** | 5 | 5 | 5 | Variable | | **Augmentations Translated** | HI, XH | PT | PT | PT | | **Cosine Similarity** | ❌ | ✔️ | ✔️ | ❌ | | **CLIP Filtering** | ❌ | ❌ | ✔️ (≥ 0.2) | ❌ | | **Target Purpose** | Low-resource training | Multilingual augmentation + similarity | High-quality filtered dataset | LaCLIP augmentation studies |
提供机构:
DiegoAlysson
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作