five

Ajax102/OmniCap-400M

收藏
Hugging Face2026-01-13 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Ajax102/OmniCap-400M
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - zh license: mit multilinguality: multilingual size_categories: - 100M<n<1B task_categories: - image-to-text - text-to-image tags: - image-text-pairs - large-scale - multimodal - captioning - retrieval annotations_creators: - machine-generated source_datasets: - web-scraped --- # OmniCap-400M **OmniCap-400M** is a large-scale, general-purpose image-text dataset containing **400 million** diverse image-caption pairs collected from the open web. It is designed to support a wide range of multimodal research tasks, including vision-language pretraining, image captioning, cross-modal retrieval, and text-to-image generation. Each entry includes rich metadata to facilitate filtering, deduplication, and analysis. ## Dataset Structure The dataset is stored in Apache Parquet format and contains the following fields: | Field | Type | Description | |-------------------|--------|-------------| | `url` | string | The source URL of the image. | | `md5` | string | MD5 hash of the image URL (for deduplication). | | `width` | int32 | Width of the image in pixels. | | `height` | int32 | Height of the image in pixels. | | `blip_caption` | string | Machine-generated caption using [BLIP](https://huggingface.co/Salesforce/blip-image-captioning-base). | | `caption`| string | Raw textual context associated with the image (e.g., alt text, surrounding HTML text). | | `query` | string | Keywords for web search. | > ⚠️ **Note**: This dataset contains web-crawled data. Users are responsible for complying with the terms of use of the source websites and applicable laws. ## Intended Use - Pretraining or fine-tuning multimodal models (e.g., CLIP, BLIP, LLaVA). - Training text-to-image diffusion models with improved caption quality. - Building cross-modal search systems. - Studying bias, safety, and robustness in large-scale vision-language data. ## License MIT License. ## Citation If you use this dataset in your research, please cite it as: ```bibtex @dataset{ajax2026omnicap, author = {Ajax102}, title = {OmniCap-400M: A Large-Scale General-Purpose Image-Text Dataset}, year = {2026}, publisher = {Hugging Face}, journal = {Hugging Face Datasets}, url = {https://huggingface.co/datasets/Ajax102/OmniCap-400M} }
提供机构:
Ajax102
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作