Multimodal C4 (mmc4)

Name: Multimodal C4 (mmc4)
Creator: University of California, Santa Barbara
Published: 2023-10-28 12:19:41
License: 暂无描述

arXiv2023-10-28 更新2024-06-21 收录

下载链接：

https://github.com/allenai/mmc4

下载链接

链接失效反馈

官方服务：

资源简介：

Multimodal C4 (mmc4)是由University of California, Santa Barbara等机构开发的开放式十亿规模图像与文本交错数据集。该数据集通过线性分配算法，利用CLIP特征将图像嵌入到c4文本数据集中，覆盖烹饪、旅行、技术等多个日常主题。经过过滤不适当内容后，数据集包含1.012亿文档，其中交错嵌入了5.71亿图像和430亿英语词汇。mmc4旨在支持复杂的视觉语言模型训练，解决图像与文本交互的多样化问题，适用于少样本学习及多模态语言技术的发展。

Multimodal C4 (mmc4) is an open, billion-scale interleaved image-text dataset developed by institutions including the University of California, Santa Barbara. This dataset embeds images into the C4 text corpus using CLIP features via a linear assignment algorithm, covering a range of daily topics such as cooking, travel, and technology. After filtering out inappropriate content, the dataset contains 101.2 million documents, which are interleaved with 571 million images and 43 billion English words. mmc4 is designed to support the training of complex vision-language models, address the diversity challenges of image-text interaction, and facilitate few-shot learning as well as the advancement of multimodal language technologies.

提供机构：

University of California, Santa Barbara

创建时间：

2023-04-14

搜集汇总

数据集介绍