Multimodal-C4 (mmc4)

OpenDataLab2026-04-05 更新2024-05-09 收录

下载链接：

https://opendatalab.org.cn/OpenDataLab/Multimodal-C4_mmc4

下载链接

链接失效反馈

资源简介：

多模态 C4 (MMC4) 是流行的纯文本 c4 语料库的增强，图像交错。语料库包含 103M 文档，其中包含 585M 图像与 43B 英文标记交错。

Multimodal C4 (MMC4) is an enhanced version of the widely popular plain-text C4 corpus, with images interleaved into the textual content. The corpus contains 103 million documents, which collectively interleave 585 million images and 43 billion English tokens.

提供机构：

OpenDataLab

创建时间：

2023-05-09

AI搜集汇总

数据集介绍