Obscure-Entropy/MangaliCa_EN-HU

Name: Obscure-Entropy/MangaliCa_EN-HU
Creator: Obscure-Entropy
Published: 2026-01-12 09:47:20
License: 暂无描述

Hugging Face2026-01-12 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Obscure-Entropy/MangaliCa_EN-HU

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en - hu license: cc-by-4.0 tags: - multimodal - bilingual - hungarian - image-text - synthetic-data size_categories: - 50M+ task_categories: - image-to-text --- # MangaliCa Bilingual Image–Caption Dataset (Hungarian–English) ## Dataset Description The **MangaliCa Bilingual Image–Caption Dataset** is a large-scale **Hungarian–English multimodal dataset** containing approximately **70 million aligned image–caption pairs**. Each image is paired with: - an **original English caption** - a **machine-translated Hungarian caption** This dataset was created to address the lack of large-scale multimodal resources for **Hungarian**, enabling bilingual vision–language training and evaluation at scale. --- ## Data Sources The dataset aggregates and extends several large English image–caption datasets: | Source Dataset | Approx. Samples | |---------------|-----------------| | DataComp-1B | ~40M | | Conceptual Captions 12M | ~8M | | GBC10M | ~8M | | PixelProse | ~14M | English captions were translated into Hungarian using a large-scale automated translation pipeline. --- ## Dataset Structure Each sample contains: - `img`: RGB image (JPEG) - `en_cap`: English caption - `hu_cap`: Hungarian caption Data is stored in **Parquet format**, sharded into ~1M-sample files for efficient streaming. --- ## Dataset Creation Pipeline 1. Automated English → Hungarian translation 2. Parallel image downloading 3. Image validation and filtering 4. Post-processing and shard merging 5. Upload to Hugging Face Hub The pipeline is fully automated and distributed across parallel workers. --- ## Intended Uses ### Supported Tasks - Vision–language pretraining - Bilingual image captioning - Image–text retrieval - Cross-lingual multimodal research ### Not Recommended For - Human annotation studies - Fine-grained linguistic evaluation without filtering - Safety-critical applications --- ## Known Limitations - Hungarian captions are **machine-translated** and may contain: - Minor grammatical errors - Literal translations - Images are web-sourced and may reflect dataset biases - Some domains (e.g., web alt-text) are noisier than curated datasets --- ## Ethical Considerations - Dataset is constructed from publicly available web data - No explicit demographic balancing was performed - Users should be cautious of cultural or societal biases present in web-scale data ---

提供机构：

Obscure-Entropy

5,000+

优质数据集

54 个

任务类型

进入经典数据集