Arabic-Image-Captioning_100M
收藏魔搭社区2026-01-02 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/Misraj/Arabic-Image-Captioning_100M
下载链接
链接失效反馈官方服务:
资源简介:
# Arabic Image Captioning Dataset (100M Sample)
**The first large-scale Arabic multimodal dataset.**
This groundbreaking dataset contains 100 million Arabic image captions, representing the first comprehensive Arabic multimodal resource of this scale and quality. Generated using our Mutarjim translation model, this dataset addresses the critical gap in Arabic multimodal AI resources and enables researchers to develop sophisticated Arabic vision-language systems for the first time.
## Dataset Description
**Size**: 100 million image-caption pairs
**Language**: Arabic
**Total Words**: Approximately 6 billion Arabic words
**Source**: Translated sample from UCSC-VLAA/Recap-DataComp-1B
**Translation Model**: Mutarjim 1.5B parameter Arabic-English translation model
## Key Features
- **First of Its Kind**: The first large-scale, Arabic multimodal dataset, filling a critical gap in Arabic AI research
- **Unprecedented Scale**: 100 million Arabic image captions - the largest Arabic multimodal dataset available
- **Superior Translation Quality**: All captions translated using Mutarjim, which outperforms models up to 20× larger on Arabic-English translation benchmarks
- **Breakthrough for Arabic AI**: Enables development of Arabic vision-language models
- **Research-Ready Format**: Structured for immediate use in multimodal research and Arabic NLP tasks
## Impact & Significance
This dataset:
- **Eliminates a Critical Bottleneck**: Removes the primary obstacle that has hindered Arabic multimodal AI development
- **Enables New Research Directions**: Opens entirely new avenues for Arabic AI research previously impossible due to data limitations
## Data Quality
- **Expert Translation**: Generated using Mutarjim's optimized two-phase training pipeline
## Technical Specifications
**Format**:
**Fields**:
- `url`: Unique identifier for the source image
- `Arabic_Translation`: High-quality Arabic translation of the original caption
- `Original_Text`: Original English caption (if included)
## Citation
If you use this dataset in your research, please cite:
```bibtex
@misc{hennara2025mutarjimadvancingbidirectionalarabicenglish,
title={Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model},
author={Khalil Hennara and Muhammad Hreden and Mohamed Motaism Hamed and Zeina Aldallal and Sara Chrouf and Safwan AlModhayan},
year={2025},
eprint={2505.17894},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.17894%7D,
}
@article{li2024recaption,
title={What If We Recaption Billions of Web Images with LLaMA-3?},
author={Xianhang Li and Haoqin Tu and Mude Hui and Zeyu Wang and Bingchen Zhao and Junfei Xiao and Sucheng Ren and Jieru Mei and Qing Liu and Huangjie Zheng and Yuyin Zhou and Cihang Xie},
journal={arXiv preprint arXiv:2406.08478},
year={2024}
}
```
## Related Resources
- **Tarjama-25 Benchmark**: https://huggingface.co/datasets/Misraj/Tarjama-25
- **Technical Paper**: https://www.arxiv.org/abs/2505.17894
- **Full Dataset**: https://huggingface.co/datasets/Misraj/Arabic-Image-Captioning_100M
## Contact
Contact us to collaborate or integrate Mutarjim into your workflow!
---
*This dataset represents a significant contribution to Arabic multimodal AI research and low-resource language support. We encourage researchers and developers to use this resource to advance Arabic NLP and multimodal understanding capabilities.*
# 阿拉伯语图像字幕数据集(1亿样本)
**首个大规模阿拉伯语多模态数据集。**
本开创性数据集包含1亿条阿拉伯语图像字幕,是目前规模与质量均达此级别的首个综合性阿拉伯语多模态资源。本数据集依托自研的Mutarjim翻译模型生成,填补了阿拉伯语多模态人工智能资源领域的关键空白,首次为研究人员开发先进的阿拉伯语视觉-语言系统提供了支撑。
## 数据集说明
**规模**:1亿条图像-字幕对
**语言**:阿拉伯语
**总词量**:约60亿阿拉伯语词汇
**来源**:源自UCSC-VLAA/Recap-DataComp-1B的翻译样本
**翻译模型**:Mutarjim 15亿参数阿拉伯语-英语翻译模型
## 核心特性
- **同类首创**:首个大规模阿拉伯语多模态数据集,填补了阿拉伯语人工智能研究的关键空白
- **规模空前**:1亿条阿拉伯语图像字幕,为现有规模最大的阿拉伯语多模态数据集
- **翻译质量卓越**:所有字幕均由Mutarjim翻译生成,该模型在阿拉伯语-英语翻译基准测试中的表现优于最大20倍的同类模型
- **阿拉伯语人工智能突破**:支持阿拉伯语视觉-语言模型的开发
- **科研就绪格式**:结构规范,可直接用于多模态研究与阿拉伯语自然语言处理任务
## 影响与价值
本数据集:
- **消除关键瓶颈**:移除了长期阻碍阿拉伯语多模态人工智能发展的核心障碍
- **开辟全新研究方向**:因数据限制此前无法开展的阿拉伯语人工智能研究,如今得以开拓全新路径
## 数据质量
- **专业译制**:依托Mutarjim优化的两阶段训练流水线生成
## 技术规格
**格式**:
**字段**:
- `url`: 源图像的唯一标识符
- `Arabic_Translation`: 原始字幕的高质量阿拉伯语译文
- `Original_Text`: 原始英语字幕(若包含)
## 引用规范
如果您在研究中使用本数据集,请引用以下文献:
bibtex
@misc{hennara2025mutarjimadvancingbidirectionalarabicenglish,
title={Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model},
author={Khalil Hennara and Muhammad Hreden and Mohamed Motaism Hamed and Zeina Aldallal and Sara Chrouf and Safwan AlModhayan},
year={2025},
eprint={2505.17894},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.17894%7D,
}
@article{li2024recaption,
title={What If We Recaption Billions of Web Images with LLaMA-3?},
author={Xianhang Li and Haoqin Tu and Mude Hui and Zeyu Wang and Bingchen Zhao and Junfei Xiao and Sucheng Ren and Jieru Mei and Qing Liu and Huangjie Zheng and Yuyin Zhou and Cihang Xie},
journal={arXiv preprint arXiv:2406.08478},
year={2024}
}
## 相关资源
- **Tarjama-25基准测试集**:https://huggingface.co/datasets/Misraj/Tarjama-25
- **技术论文**:https://www.arxiv.org/abs/2505.17894
- **完整数据集**:https://huggingface.co/datasets/Misraj/Arabic-Image-Captioning_100M
## 联系方式
欢迎联系我们开展合作或在您的工作流中集成Mutarjim!
---
*本数据集为阿拉伯语多模态人工智能研究与低资源语言支持做出了重要贡献。我们鼓励研究人员与开发者利用该资源,推动阿拉伯语自然语言处理与多模态理解能力的发展。*
提供机构:
maas
创建时间:
2025-07-07



