five

Obscure-Entropy/MangaliCa_EN-HU

收藏
Hugging Face2026-01-12 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Obscure-Entropy/MangaliCa_EN-HU
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - hu license: cc-by-4.0 tags: - multimodal - bilingual - hungarian - image-text - synthetic-data size_categories: - 50M+ task_categories: - image-to-text --- # MangaliCa Bilingual Image–Caption Dataset (Hungarian–English) ## Dataset Description The **MangaliCa Bilingual Image–Caption Dataset** is a large-scale **Hungarian–English multimodal dataset** containing approximately **70 million aligned image–caption pairs**. Each image is paired with: - an **original English caption** - a **machine-translated Hungarian caption** This dataset was created to address the lack of large-scale multimodal resources for **Hungarian**, enabling bilingual vision–language training and evaluation at scale. --- ## Data Sources The dataset aggregates and extends several large English image–caption datasets: | Source Dataset | Approx. Samples | |---------------|-----------------| | DataComp-1B | ~40M | | Conceptual Captions 12M | ~8M | | GBC10M | ~8M | | PixelProse | ~14M | English captions were translated into Hungarian using a large-scale automated translation pipeline. --- ## Dataset Structure Each sample contains: - `img`: RGB image (JPEG) - `en_cap`: English caption - `hu_cap`: Hungarian caption Data is stored in **Parquet format**, sharded into ~1M-sample files for efficient streaming. --- ## Dataset Creation Pipeline 1. Automated English → Hungarian translation 2. Parallel image downloading 3. Image validation and filtering 4. Post-processing and shard merging 5. Upload to Hugging Face Hub The pipeline is fully automated and distributed across parallel workers. --- ## Intended Uses ### Supported Tasks - Vision–language pretraining - Bilingual image captioning - Image–text retrieval - Cross-lingual multimodal research ### Not Recommended For - Human annotation studies - Fine-grained linguistic evaluation without filtering - Safety-critical applications --- ## Known Limitations - Hungarian captions are **machine-translated** and may contain: - Minor grammatical errors - Literal translations - Images are web-sourced and may reflect dataset biases - Some domains (e.g., web alt-text) are noisier than curated datasets --- ## Ethical Considerations - Dataset is constructed from publicly available web data - No explicit demographic balancing was performed - Users should be cautious of cultural or societal biases present in web-scale data ---
提供机构:
Obscure-Entropy
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作