Ajax102/OmniCap-400M
收藏Hugging Face2026-01-13 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Ajax102/OmniCap-400M
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- zh
license: mit
multilinguality: multilingual
size_categories:
- 100M<n<1B
task_categories:
- image-to-text
- text-to-image
tags:
- image-text-pairs
- large-scale
- multimodal
- captioning
- retrieval
annotations_creators:
- machine-generated
source_datasets:
- web-scraped
---
# OmniCap-400M
**OmniCap-400M** is a large-scale, general-purpose image-text dataset containing **400 million** diverse image-caption pairs collected from the open web. It is designed to support a wide range of multimodal research tasks, including vision-language pretraining, image captioning, cross-modal retrieval, and text-to-image generation.
Each entry includes rich metadata to facilitate filtering, deduplication, and analysis.
## Dataset Structure
The dataset is stored in Apache Parquet format and contains the following fields:
| Field | Type | Description |
|-------------------|--------|-------------|
| `url` | string | The source URL of the image. |
| `md5` | string | MD5 hash of the image URL (for deduplication). |
| `width` | int32 | Width of the image in pixels. |
| `height` | int32 | Height of the image in pixels. |
| `blip_caption` | string | Machine-generated caption using [BLIP](https://huggingface.co/Salesforce/blip-image-captioning-base). |
| `caption`| string | Raw textual context associated with the image (e.g., alt text, surrounding HTML text). |
| `query` | string | Keywords for web search. |
> ⚠️ **Note**: This dataset contains web-crawled data. Users are responsible for complying with the terms of use of the source websites and applicable laws.
## Intended Use
- Pretraining or fine-tuning multimodal models (e.g., CLIP, BLIP, LLaVA).
- Training text-to-image diffusion models with improved caption quality.
- Building cross-modal search systems.
- Studying bias, safety, and robustness in large-scale vision-language data.
## License
MIT License.
## Citation
If you use this dataset in your research, please cite it as:
```bibtex
@dataset{ajax2026omnicap,
author = {Ajax102},
title = {OmniCap-400M: A Large-Scale General-Purpose Image-Text Dataset},
year = {2026},
publisher = {Hugging Face},
journal = {Hugging Face Datasets},
url = {https://huggingface.co/datasets/Ajax102/OmniCap-400M}
}
提供机构:
Ajax102



