Obscure-Entropy/MangaliCa_EN-HU
收藏Hugging Face2026-01-12 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Obscure-Entropy/MangaliCa_EN-HU
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- hu
license: cc-by-4.0
tags:
- multimodal
- bilingual
- hungarian
- image-text
- synthetic-data
size_categories:
- 50M+
task_categories:
- image-to-text
---
# MangaliCa Bilingual Image–Caption Dataset (Hungarian–English)
## Dataset Description
The **MangaliCa Bilingual Image–Caption Dataset** is a large-scale **Hungarian–English multimodal dataset** containing approximately **70 million aligned image–caption pairs**.
Each image is paired with:
- an **original English caption**
- a **machine-translated Hungarian caption**
This dataset was created to address the lack of large-scale multimodal resources for **Hungarian**, enabling bilingual vision–language training and evaluation at scale.
---
## Data Sources
The dataset aggregates and extends several large English image–caption datasets:
| Source Dataset | Approx. Samples |
|---------------|-----------------|
| DataComp-1B | ~40M |
| Conceptual Captions 12M | ~8M |
| GBC10M | ~8M |
| PixelProse | ~14M |
English captions were translated into Hungarian using a large-scale automated translation pipeline.
---
## Dataset Structure
Each sample contains:
- `img`: RGB image (JPEG)
- `en_cap`: English caption
- `hu_cap`: Hungarian caption
Data is stored in **Parquet format**, sharded into ~1M-sample files for efficient streaming.
---
## Dataset Creation Pipeline
1. Automated English → Hungarian translation
2. Parallel image downloading
3. Image validation and filtering
4. Post-processing and shard merging
5. Upload to Hugging Face Hub
The pipeline is fully automated and distributed across parallel workers.
---
## Intended Uses
### Supported Tasks
- Vision–language pretraining
- Bilingual image captioning
- Image–text retrieval
- Cross-lingual multimodal research
### Not Recommended For
- Human annotation studies
- Fine-grained linguistic evaluation without filtering
- Safety-critical applications
---
## Known Limitations
- Hungarian captions are **machine-translated** and may contain:
- Minor grammatical errors
- Literal translations
- Images are web-sourced and may reflect dataset biases
- Some domains (e.g., web alt-text) are noisier than curated datasets
---
## Ethical Considerations
- Dataset is constructed from publicly available web data
- No explicit demographic balancing was performed
- Users should be cautious of cultural or societal biases present in web-scale data
---
提供机构:
Obscure-Entropy



