OmniCorpus-CC-210M

Name: OmniCorpus-CC-210M
Creator: maas
Published: 2025-12-04 16:17:42
License: 暂无描述

魔搭社区2025-12-04 更新2024-10-26 收录

下载链接：

https://modelscope.cn/datasets/OpenGVLab/OmniCorpus-CC-210M

下载链接

链接失效反馈

官方服务：

资源简介：

<p align="center"> <h1 align="center">🐳 OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text</h1> </p> This repository contains 210 million image-text interleaved documents filtered from the [OmniCorpus-CC](https://huggingface.co/datasets/OpenGVLab/OmniCorpus-CC) dataset, which was sourced from [Common Crawl](https://commoncrawl.org/). - Repository: https://github.com/OpenGVLab/OmniCorpus - Paper (ICLR 2025 Spotlight): https://arxiv.org/abs/2406.08418 OmniCorpus dataset is a large-scale image-text interleaved dataset, which pushes the boundaries of scale and diversity by encompassing **8.6 billion images** interleaved with **1,696 billion text tokens** from diverse sources, significantly surpassing previous datasets. This dataset demonstrates several advantages over its counterparts: 1. **Larger data scale:** Our dataset is 1.7 times larger in images and 12.5 times larger in texts compared to the previously largest multimodal dataset, LAION-5B, while maintaining excellent data quality. 2. **Richer data diversity:** Drawing from a broader range of data sources, our dataset is more diverse than other image-text interleaved datasets. It includes bilingual multimodal data in both Chinese and English, and encompasses text-centric and vision-centric documents extracted from common websites and video platforms. 3. **More flexible format:** The streaming data format of our dataset offers exceptional flexibility, allowing adaptation to various data structures, including pure text corpora, image-text pairs, and interleaved data formats. <img width="578" alt="image" src="https://github.com/OpenGVLab/OmniCorpus/assets/47669167/641a6427-ba50-41e6-8634-8810113fd803"> The OmniCorpus contains three sections: - **OmniCorpus-CC**: processed from dumps in Common Crawl from 2013 to Nov./Dec. 2023. - **OmniCorpus-CW**: sourced from Chinese internet resources, will be availiable in [OpenDataLab](https://opendatalab.com/) platform. - **OmniCorpus-YT**: samples Youtube video frames as images and collects subtitles as texts. Code for pre-training, evaluating, main body extracting, and filtering have been released in the official [repository](https://github.com/OpenGVLab/OmniCorpus). A pre-trained model is availiable [here](https://huggingface.co/Qingyun/OmniCorpus-InternVL). ### Update (2024-08-30): We release the natural arrangement version of the OmniCorpus-CC documents, now available in the `data` folder. Coming soon: - Shuffled Parquet Shards: The same document content in a shuffled format. - Documents with Similarities: Documents with split at the sentence level, resulting in minor differences of text content. # Data Pipeline Our data pipeline consists of five key stages: main body extraction, preliminary text filtering, document deduplication, image downloading \& filtering, and detailed text filtering. Each stage efficiently reduces the dataset to retain only high-quality data. Please refer to our paper for more details about the data pipeline. <img width="723" alt="image" src="https://github.com/OpenGVLab/OmniCorpus/assets/47669167/a6de8928-58fb-4ff4-8ef9-4bd90e9ada5f"> # Usages The image-text interleaved documents are recommanded for the following usages: - Pre-training multimodal large language model (MLLM): Recent MLLMs (such as Flamingo series, EMU series, IDEFICS series, MM1, Cambrian-1, and xGen-MM) have shown that image-text interleaved data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. - Long text-image retrieval: We provide image-text similarities calculated with CLIP, which can convert the documents to image-text retrieval dataset with longer text. A retrieval model pre-trained on such data can retrieval images based on longer text, which can be used for multimodal RAG, converting pure text to multimodal sample, etc. - Source for futher dataset research: Our data is large-scale, which can serve as the source for researches for data curation strategies. We provide many useful attributes as metadata for each document, which can enrich the filtering strategy and reduce the cost. - ...... # Data Format Following common practices, the data is organized into Parquet file format. You might encounter errors when using `pandas.read_parquet` (because the data structure contains nested elements). We recommend using fastparquet to load the parquet files. ```Python import fastparquet df = fastparquet.ParquetFile(parquet_file_path).to_pandas() # You can also use iter_batches parquet_file = pq.ParquetFile(filepath) for batch in parquet_file.iter_batches(): df = batch.to_pandas() ``` You can convert the i-th document and convert it into a dictionary. ```Python doc_dict = df.iloc[i].to_dict() ``` The document format is as follow: ```json { 'images': [ <str: image_1_url>, None, <str: image_2_url>, None, ], 'texts': [ None, <str: text_paragraph_1_content> None, <str: text_paragraph_2_content>, ] 'metadata': [ <dict: image_1_metadata>, None, <dict: image_2_metadata>, None ], 'general_metadata': { "url": <str: document url>, "id": <str: document id>, "domain": <list[str]: domains extracted from document url>, "fluency_prob": <float: the probability of fluency>, "non_advertisement_prob": <float: the probability of non-advertisement>, "porn_prob": <float: the probability of porn content>, "politics_prob": <float: the probability of politics content>, "toxic_prob": <float: the probability of toxic content>, } } ``` Each image metadata is as follow: ```json { "img_url_sha": <str: sha code of image url>, "width": <int: image width>, "height": <int: image height>, "bytes": <int: byte number of the image file>, "d_hash": <str: d_hash code of the image, used for image deduplication>, "p_hash": <str: p_hash code of the image, used for image deduplication>, "d_hash_dup_count": <int: duplicated times detected by d_hash code>, "p_hash_dup_count": <int: duplicated times detected by p_hash code>, "aesthetic prob": <float: aesthetic probility>, "unsafe prob": <float: NSFW probility>, } ``` # License and Terms of Use The OmniCorpus dataset is distributed under [the CC BY 4.0 License](https://creativecommons.org/licenses/by/4.0/). The open-source code is released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0). The Terms of Use (ToUs) have been developed based on widely accepted standards. By accessing or using this dataset, users acknowledge their responsibility to comply with all relevant legal, regulatory, and ethical standards. - All users, whether from academia or industry, must comply with the ToUs outlined in the CC BY 4.0 License. - Any derived datasets or models must acknowledge the use of the OmniCorpus dataset to maintain transparency. - The OmniCorpus must not be used in any project involving sensitive content or harmful outcomes, including but not limited to political manipulation, hate speech generation, misinformation propagation, or tasks that perpetuate harmful stereotypes or biases. - The use of this dataset in any manner that violates rights, such as copyright infringement, privacy breaches, or misuse of sensitive information, is strictly prohibited. - While we do not enforce jurisdiction-specific terms, we strongly recommend that users ensure compliance with applicable local laws and regulations. - The use of specific subset must comply with the ToUs of the primary source. Specifically, the use of OmniCorpus-CC, OmniCorpus-CW, and OmniCorpus-YT must comply with [the Common Crawl ToUs](https://commoncrawl.org/terms-of-use), the [regulations](https://www.gov.cn/zhengce/content/202409/content\_6977766.htm) on the security management of Internet data in China, and [YouTube’s ToUs](https://www.youtube.com/terms), respectively. - These ToUs do not supersede the ToUs of the original content sources. Users must ensure that any use of the dataset’s content complies with the original ToUs and the rights of the data subjects. # Citation ``` @inproceedings{li2024omnicorpus, title={OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text}, author={Li, Qingyun and Chen, Zhe and Wang, Weiyun and Wang, Wenhai and Ye, Shenglong and Jin, Zhenjiang and others}, booktitle={The Thirteenth International Conference on Learning Representations}, year={2025} } ```

<p align="center"> <h1 align="center">🐳 OmniCorpus：100亿级图文交错统一多模态语料库</h1> </p> 本仓库包含从OmniCorpus-CC数据集中筛选得到的2.1亿份图文交错文档，而OmniCorpus-CC数据集源自通用网页爬取库（Common Crawl）。 - 仓库地址：https://github.com/OpenGVLab/OmniCorpus - 论文（获国际学习表征会议（ICLR）2025 Spotlight收录）：https://arxiv.org/abs/2406.08418 OmniCorpus数据集是一款大规模图文交错多模态数据集，其涵盖来自多样化数据源的**86亿张图像**与**1.696万亿个文本Token**，在数据规模与多样性上均突破现有边界，显著领先于此前的同类数据集。本数据集相较于同类产品具备三大优势： 1. **数据规模更庞大**：相较于此前规模最大的多模态数据集LAION-5B，本数据集的图像数量提升1.7倍，文本Token总量提升12.5倍，同时保持了优异的数据质量。 2. **数据多样性更丰富**：本数据集依托更广泛的数据源，相较于其他图文交错数据集具有更强的多样性。其包含中英双语多模态数据，涵盖从主流网站与视频平台提取的以文本为中心及以视觉为中心的文档。 3. **数据格式更灵活**：本数据集采用流式数据格式，具备极高的适配性，可支持多种数据结构，包括纯文本语料库、图文对及交错数据格式。 <img width="578" alt="图像" src="https://github.com/OpenGVLab/OmniCorpus/assets/47669167/641a6427-ba50-41e6-8634-8810113fd803"> OmniCorpus数据集包含三个子模块： - **OmniCorpus-CC**：源自2013年至2023年11/12月的Common Crawl数据快照。 - **OmniCorpus-CW**：源自中文互联网资源，将在开放数据平台（OpenDataLab）平台上线。 - **OmniCorpus-YT**：提取YouTube视频帧作为图像，并收集对应字幕作为文本。预训练、评估、正文提取与数据过滤的相关代码已在官方仓库（https://github.com/OpenGVLab/OmniCorpus）开源。预训练模型可在此处获取：https://huggingface.co/Qingyun/OmniCorpus-InternVL。 ### 更新日志（2024-08-30）：我们已发布OmniCorpus-CC文档的自然排版版本，现已收录于`data`文件夹中。即将推出： - 打乱版Parquet分片：采用随机打乱格式的同文档内容。 - 相似度文档集：按句子级别拆分后存在细微文本差异的文档集合。 # 数据处理流程本数据集的处理流程包含五个核心阶段：正文提取、初步文本过滤、文档去重、图像下载与过滤、精细化文本过滤。每个阶段均对数据集进行高效筛选，仅保留高质量数据。有关数据处理流程的更多细节，请参阅我们的论文。 <img width="723" alt="图像" src="https://github.com/OpenGVLab/OmniCorpus/assets/47669167/a6de8928-58fb-4ff4-8ef9-4bd90e9ada5f"> # 应用场景本数据集的图文交错文档适用于以下场景： - 多模态大语言模型（Multimodal Large Language Model, MLLM）预训练：近期的MLLM（如Flamingo系列、EMU系列、IDEFICS系列、MM1、Cambrian-1及xGen-MM）均已证实，图文交错数据可助力多模态上下文学习，并在多模态微调过程中保留大语言模型的原有能力。 - 长文本-图像检索：我们提供了基于CLIP计算得到的图文相似度，可将本数据集转换为支持长文本的图文检索数据集。基于此类数据预训练的检索模型可根据长文本检索对应图像，可应用于多模态检索增强生成（Retrieval-Augmented Generation, RAG）、将纯文本转换为多模态样本等场景。 - 后续数据集研究的数据源：本数据集规模庞大，可作为数据整理策略研究的基础数据源。我们为每份文档提供了丰富的元数据属性，可用于优化过滤策略并降低研究成本。 - 其他潜在应用场景…… # 数据格式遵循行业通用规范，本数据集采用Parquet文件格式组织。若使用`pandas.read_parquet`加载数据，可能会因数据结构包含嵌套元素而报错，我们推荐使用fastparquet库加载Parquet文件。 Python import fastparquet df = fastparquet.ParquetFile(parquet_file_path).to_pandas() # 你也可以使用iter_batches方法分批加载 parquet_file = pq.ParquetFile(filepath) for batch in parquet_file.iter_batches(): df = batch.to_pandas() 你可以将第i份文档转换为字典格式： Python doc_dict = df.iloc[i].to_dict() 文档的标准格式如下： json { "images": [ "<字符串类型：图像1的URL>", null, "<字符串类型：图像2的URL>", null ], "texts": [ null, "<字符串类型：文本段落1的内容>", null, "<字符串类型：文本段落2的内容>" ], "metadata": [ "<字典类型：图像1的元数据>", null, "<字典类型：图像2的元数据>", null ], "general_metadata": { "url": "<字符串类型：文档的URL>", "id": "<字符串类型：文档的唯一标识ID>", "domain": "<字符串列表类型：从文档URL中提取的域名>", "fluency_prob": "<浮点类型：文本流畅度概率>", "non_advertisement_prob": "<浮点类型：非广告内容概率>", "porn_prob": "<浮点类型：色情内容概率>", "politics_prob": "<浮点类型：政治敏感内容概率>", "toxic_prob": "<浮点类型：有害内容概率>" } } 单张图像的元数据格式如下： json { "img_url_sha": "<字符串类型：图像URL的SHA哈希值>", "width": "<整数类型：图像宽度>", "height": "<整数类型：图像高度>", "bytes": "<整数类型：图像文件的字节数>", "d_hash": "<字符串类型：图像的d_hash哈希值，用于图像去重>", "p_hash": "<字符串类型：图像的p_hash哈希值，用于图像去重>", "d_hash_dup_count": "<整数类型：通过d_hash检测到的重复次数>", "p_hash_dup_count": "<整数类型：通过p_hash检测到的重复次数>", "aesthetic_prob": "<浮点类型：图像美学评分概率>", "unsafe_prob": "<浮点类型：不适宜工作内容（NSFW）概率>" } # 许可协议与使用条款 OmniCorpus数据集采用**CC BY 4.0许可协议**（https://creativecommons.org/licenses/by/4.0/）进行分发。本项目的开源代码采用**Apache License 2.0**（https://www.apache.org/licenses/LICENSE-2.0）协议开源。本使用条款（Terms of Use, ToU）基于行业通用标准制定。用户访问或使用本数据集即视为同意承担遵守所有相关法律、监管及伦理标准的责任。 - 所有用户，无论来自学术界还是工业界，均须遵守CC BY 4.0许可协议中规定的使用条款。 - 任何基于本数据集衍生的数据集或模型，必须明确标注使用了OmniCorpus数据集，以保证透明度。 - 禁止将OmniCorpus数据集用于涉及敏感内容或有害结果的项目，包括但不限于政治操纵、仇恨言论生成、虚假信息传播，或加剧有害刻板印象与偏见的任务。 - 严禁以任何侵犯他人权利的方式使用本数据集，包括但不限于版权侵权、隐私泄露或滥用敏感信息。 - 尽管我们不强制要求遵守特定司法辖区的条款，但我们强烈建议用户确保其使用行为符合当地适用的法律法规。 - 使用特定子模块时须遵守其原始数据源的使用条款：具体而言，OmniCorpus-CC、OmniCorpus-CW和OmniCorpus-YT的使用须分别遵守Common Crawl使用条款（https://commoncrawl.org/terms-of-use）、中国《互联网数据安全管理规定》（https://www.gov.cn/zhengce/content/202409/content_6977766.htm）及YouTube使用条款（https://www.youtube.com/terms）。 - 本使用条款不替代原始内容源的使用条款。用户须确保对数据集内容的任何使用均符合原始数据源的使用条款及数据主体的相关权利。 # 引用格式 @inproceedings{li2024omnicorpus, title={OmniCorpus: 100亿级图文交错统一多模态语料库}, author={Li, Qingyun and Chen, Zhe and Wang, Weiyun and Wang, Wenhai and Ye, Shenglong and Jin, Zhenjiang and others}, booktitle={第十三届国际学习表征会议}, year={2025} }

提供机构：

maas

创建时间：

2024-10-23

搜集汇总

数据集介绍

背景与挑战

背景概述

OmniCorpus-CC-210M是一个大规模多模态语料库，包含2.1亿个图文交错文档，源自Common Crawl，具有8.6亿张图片和1,6960亿文本标记的庞大尺度。其特点包括丰富的数据多样性，涵盖中英文双语内容以及文本和视觉中心文档，并采用灵活的流式数据格式，适用于多模态大语言模型预训练和长文本图像检索等任务。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集