OmniCorpus-CC
收藏魔搭社区2025-11-12 更新2024-10-26 收录
下载链接:
https://modelscope.cn/datasets/OpenGVLab/OmniCorpus-CC
下载链接
链接失效反馈官方服务:
资源简介:
<p align="center">
<h1 align="center">🐳 OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text</h1>
</p>
> ⭐️ **NOTE:** Several parquet files were marked unsafe (viruses) by official scaning of hf, while they are reported safe by ClamAV and Virustotal.
> We found [many false positive cases](https://discuss.huggingface.co/u/mcpotato/summary) of the hf automatic scanning in hf discussions and raise [one discussion](https://discuss.huggingface.co/t/one-parquet-file-of-my-dataset-was-marked-unsafe/113745) to ask for a re-scanning.
This is the repository of OmniCorpus-CC, which contains 988 million image-text interleaved documents collected from [Common Crawl](https://commoncrawl.org/).
- Repository: https://github.com/OpenGVLab/OmniCorpus
- Paper (ICLR 2025 Spotlight): https://arxiv.org/abs/2406.08418
OmniCorpus dataset is a large-scale image-text interleaved dataset, which pushes the boundaries of scale and diversity by encompassing **8.6 billion images** interleaved with **1,696 text tokens** from diverse sources, significantly surpassing previous datasets.
This dataset demonstrates several advantages over its counterparts:
1. **Larger data scale:** Our dataset is 1.7 times larger in images and 12.5 times larger in texts compared to the previously largest multimodal dataset, LAION-5B, while maintaining excellent data quality.
2. **Richer data diversity:** Drawing from a broader range of data sources, our dataset is more diverse than other image-text interleaved datasets. It includes bilingual multimodal data in both Chinese and English, and encompasses text-centric and vision-centric documents extracted from common websites and video platforms.
3. **More flexible format:** The streaming data format of our dataset offers exceptional flexibility, allowing adaptation to various data structures, including pure text corpora, image-text pairs, and interleaved data formats.
<img width="578" alt="image" src="https://github.com/OpenGVLab/OmniCorpus/assets/47669167/641a6427-ba50-41e6-8634-8810113fd803">
The OmniCorpus contains three sections:
- **OmniCorpus-CC**: processed from dumps in Common Crawl from 2013 to Nov./Dec. 2023.
- **OmniCorpus-CW**: sourced from Chinese internet resources, will be availiable in [OpenDataLab](https://opendatalab.com/) platform.
- **OmniCorpus-YT**: samples Youtube video frames as images and collects subtitles as texts.
Code for pre-training, evaluating, main body extracting, and filtering have been released in the official [repository](https://github.com/OpenGVLab/OmniCorpus). A pre-trained model is availiable [here](https://huggingface.co/Qingyun/OmniCorpus-InternVL).
# Data Pipeline
Our data pipeline consists of five key stages: main body extraction, preliminary text filtering, document deduplication, image downloading \& filtering, and detailed text filtering. Each stage efficiently reduces the dataset to retain only high-quality data.
Please refer to our paper for more details about the data pipeline.
<img width="723" alt="image" src="https://github.com/OpenGVLab/OmniCorpus/assets/47669167/a6de8928-58fb-4ff4-8ef9-4bd90e9ada5f">
# Usages
The image-text interleaved documents are recommanded for the following usages:
- Pre-training multimodal large language model (MLLM): Recent MLLMs (such as Flamingo series, EMU series, IDEFICS series, MM1, Cambrian-1, and xGen-MM) have shown that image-text interleaved data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning.
- Long text-image retrieval: We provide image-text similarities calculated with CLIP, which can convert the documents to image-text retrieval dataset with longer text. A retrieval model pre-trained on such data can retrieval images based on longer text, which can be used for multimodal RAG, converting pure text to multimodal sample, etc.
- Source for futher dataset research: Our data is large-scale, which can serve as the source for researches for data curation strategies. We provide many useful attributes as metadata for each document, which can enrich the filtering strategy and reduce the cost.
- ......
# Data Format
Following common practices, the data is organized into Parquet file format.
You might encounter errors when using `pandas.read_parquet` (because the data structure contains nested elements). We recommend using fastparquet to load the parquet files.
```Python
import fastparquet
df = fastparquet.ParquetFile(parquet_file_path).to_pandas()
# You can also use iter_batches
parquet_file = pq.ParquetFile(filepath)
for batch in parquet_file.iter_batches():
df = batch.to_pandas()
```
You can convert the i-th document and convert it into a dictionary.
```Python
doc_dict = df.iloc[i].to_dict()
```
The document format is as follow:
```json
{
'images': [
<str: image_1_url>,
None,
<str: image_2_url>,
None,
],
'texts': [
None,
<str: text_paragraph_1_content>
None,
<str: text_paragraph_2_content>,
]
'metadata': [
<dict: image_1_metadata>,
None,
<dict: image_2_metadata>,
None
],
'general_metadata': {
"url": <str: document url>,
"id": <str: document id>,
"domain": <list[str]: domains extracted from document url>,
"fluency_prob": <float: the probability of fluency>,
"non_advertisement_prob": <float: the probability of non-advertisement>,
"porn_prob": <float: the probability of porn content>,
"politics_prob": <float: the probability of politics content>,
"toxic_prob": <float: the probability of toxic content>,
}
}
```
Each image metadata is as follow:
```json
{
"img_url_sha": <str: sha code of image url>,
"width": <int: image width>,
"height": <int: image height>,
"bytes": <int: byte number of the image file>,
"d_hash": <str: d_hash code of the image, used for image deduplication>,
"p_hash": <str: p_hash code of the image, used for image deduplication>,
"d_hash_dup_count": <int: duplicated times detected by d_hash code>,
"p_hash_dup_count": <int: duplicated times detected by p_hash code>,
"aesthetic prob": <float: aesthetic probility>,
"unsafe prob": <float: NSFW probility>,
}
```
# License and Terms of Use
The OmniCorpus dataset is distributed under [the CC BY 4.0 License](https://creativecommons.org/licenses/by/4.0/). The open-source code is released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).
The Terms of Use (ToUs) have been developed based on widely accepted standards. By accessing or using this dataset, users acknowledge their responsibility to comply with all relevant legal, regulatory, and ethical standards.
- All users, whether from academia or industry, must comply with the ToUs outlined in the CC BY 4.0 License.
- Any derived datasets or models must acknowledge the use of the OmniCorpus dataset to maintain transparency.
- The OmniCorpus must not be used in any project involving sensitive content or harmful outcomes, including but not limited to political manipulation, hate speech generation, misinformation propagation, or tasks that perpetuate harmful stereotypes or biases.
- The use of this dataset in any manner that violates rights, such as copyright infringement, privacy breaches, or misuse of sensitive information, is strictly prohibited.
- While we do not enforce jurisdiction-specific terms, we strongly recommend that users ensure compliance with applicable local laws and regulations.
- The use of specific subset must comply with the ToUs of the primary source. Specifically, the use of OmniCorpus-CC, OmniCorpus-CW, and OmniCorpus-YT must comply with [the Common Crawl ToUs](https://commoncrawl.org/terms-of-use), the [regulations](https://www.gov.cn/zhengce/content/202409/content\_6977766.htm) on the security management of Internet data in China, and [YouTube’s ToUs](https://www.youtube.com/terms), respectively.
- These ToUs do not supersede the ToUs of the original content sources. Users must ensure that any use of the dataset’s content complies with the original ToUs and the rights of the data subjects.
# Citation
```
@inproceedings{li2024omnicorpus,
title={OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text},
author={Li, Qingyun and Chen, Zhe and Wang, Weiyun and Wang, Wenhai and Ye, Shenglong and Jin, Zhenjiang and others},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025}
}
```
<p align="center">
<h1 align="center">🐳 OmniCorpus:一款包含百亿级图文交错数据的统一多模态语料库</h1>
</p>
> ⭐️ **注:** 经Hugging Face官方扫描显示,部分Parquet文件被标记为存在不安全内容(含病毒),但ClamAV与Virustotal均判定其为安全文件。我们在Hugging Face社区讨论区发现了多起自动扫描误报案例,并发起了一则讨论帖请求对本数据集的某一Parquet文件重新扫描。
本仓库为OmniCorpus-CC的托管仓库,该子集包含从[Common Crawl](https://commoncrawl.org/)采集的9.88亿条图文交错文档。
- 官方仓库:https://github.com/OpenGVLab/OmniCorpus
- 论文(ICLR 2025 Spotlight论文):https://arxiv.org/abs/2406.08418
OmniCorpus数据集是一款大规模图文交错多模态数据集,通过纳入**86亿张图像**与**1696个文本Token**并覆盖多样化数据源,突破了现有数据集的规模与多样性边界,显著领先于此前的同类数据集。
本数据集相较于同类竞品具备以下三大优势:
1. **更大的数据规模**:与此前规模最大的多模态数据集LAION-5B相比,本数据集的图像数量提升1.7倍,文本总量提升12.5倍,同时保持了优异的数据质量。
2. **更丰富的数据多样性**:本数据集覆盖了更广泛的数据源,多样性优于其他图文交错数据集。其包含中英双语多模态数据,同时涵盖从通用网站与视频平台提取的以文本为中心、以视觉为中心的两类文档。
3. **更灵活的数据格式**:本数据集采用流式数据格式,具备极强的灵活性,可适配多种数据结构,包括纯文本语料库、图文对以及图文交错格式。
<img width="578" alt="数据集规模对比示意图" src="https://github.com/OpenGVLab/OmniCorpus/assets/47669167/641a6427-ba50-41e6-8634-8810113fd803">
OmniCorpus包含三个子集:
- **OmniCorpus-CC**:基于2013年至2023年11/12月的Common Crawl数据转储文件处理得到。
- **OmniCorpus-CW**:数据源为中文互联网资源,将在[OpenDataLab](https://opendatalab.com/)平台上线。
- **OmniCorpus-YT**:采样YouTube视频帧作为图像数据,并收集字幕作为文本数据。
预训练、评估、正文提取与过滤相关代码已在官方[仓库](https://github.com/OpenGVLab/OmniCorpus)中开源。一款预训练模型已在[此处](https://huggingface.co/Qingyun/OmniCorpus-InternVL)发布。
# 数据处理流水线
本数据集的数据处理流水线包含五个关键阶段:正文提取、初步文本过滤、文档去重、图像下载与过滤、精细文本过滤。每个阶段均对数据集进行高效精简,仅保留高质量数据。有关数据流水线的更多细节,请参考我们的论文。
<img width="723" alt="数据处理流水线示意图" src="https://github.com/OpenGVLab/OmniCorpus/assets/47669167/a6de8928-58fb-4ff4-8ef9-4bd90e9ada5f">
# 应用场景
我们推荐将本数据集的图文交错文档应用于以下场景:
- **多模态大语言模型(Multimodal Large Language Model, MLLM)预训练**:近期的多模态大语言模型(如Flamingo系列、EMU系列、IDEFICS系列、MM1、Cambrian-1以及xGen-MM)已证实,图文交错数据有助于实现多模态上下文学习,并在多模态微调过程中保留大语言模型的原生能力。
- **长文本-图像检索**:我们提供了基于CLIP计算得到的图文相似度,可将本数据集转换为支持长文本的图文检索数据集。基于此类数据预训练的检索模型可根据长文本检索图像,可应用于多模态检索增强生成(Multimodal Retrieval-Augmented Generation, MLLM-RAG)、将纯文本转换为多模态样本等场景。
- **面向后续数据集研究的基准数据源**:本数据集具备大规模特性,可作为数据整理策略相关研究的基准数据源。我们为每条文档提供了丰富的元数据属性,可用于丰富过滤策略并降低研究成本。
- ......
# 数据格式
遵循行业通用规范,本数据集采用Parquet文件格式存储。
使用`pandas.read_parquet`可能会遇到报错(因数据结构包含嵌套元素),我们推荐使用fastparquet库加载Parquet文件。
Python
import fastparquet
df = fastparquet.ParquetFile(parquet_file_path).to_pandas()
# 你也可以使用iter_batches方法逐批加载
parquet_file = pq.ParquetFile(filepath)
for batch in parquet_file.iter_batches():
df = batch.to_pandas()
你可以将第i条文档转换为字典格式:
Python
doc_dict = df.iloc[i].to_dict()
单条文档的格式如下:
json
{
'images': [
<字符串: 图像1的URL>,
None,
<字符串: 图像2的URL>,
None,
],
'texts': [
None,
<字符串: 文本段落1的内容>,
None,
<字符串: 文本段落2的内容>,
],
'metadata': [
<字典: 图像1的元数据>,
None,
<字典: 图像2的元数据>,
None
],
'general_metadata': {
"url": <字符串: 文档的URL>,
"id": <字符串: 文档的唯一标识>,
"domain": <字符串列表: 从文档URL中提取的域名>,
"fluency_prob": <浮点数: 文本流畅度概率>,
"non_advertisement_prob": <浮点数: 非广告内容概率>,
"porn_prob": <浮点数: 色情内容概率>,
"politics_prob": <浮点数: 政治敏感内容概率>,
"toxic_prob": <浮点数: 有害内容概率>,
}
}
单张图像的元数据格式如下:
json
{
"img_url_sha": <字符串: 图像URL的SHA哈希值>,
"width": <整数: 图像宽度>,
"height": <整数: 图像高度>,
"bytes": <整数: 图像文件的字节数>,
"d_hash": <字符串: 图像的d_hash哈希值,用于图像去重>,
"p_hash": <字符串: 图像的p_hash哈希值,用于图像去重>,
"d_hash_dup_count": <整数: 通过d_hash检测到的重复次数>,
"p_hash_dup_count": <整数: 通过p_hash检测到的重复次数>,
"aesthetic prob": <浮点数: 美学评分概率>,
"unsafe prob": <浮点数: NSFW内容概率>,
}
# 许可证与使用条款
OmniCorpus数据集采用[CC BY 4.0许可证](https://creativecommons.org/licenses/by/4.0/)进行分发。其开源代码采用[Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0)协议授权。
本数据集的使用条款(ToUs)基于广泛认可的标准制定。用户访问或使用本数据集即表明其已认可需遵守所有相关法律、监管及伦理标准:
- 所有用户(无论来自学术界还是工业界)均需遵守CC BY 4.0许可证中规定的使用条款。
- 任何基于本数据集衍生的数据集或模型均需明确标注使用了OmniCorpus数据集,以保证透明度。
- 严禁将OmniCorpus数据集用于涉及敏感内容或有害结果的项目,包括但不限于政治操纵、仇恨言论生成、虚假信息传播,或任何助长有害刻板印象与偏见的任务。
- 严禁以任何侵犯他人权利的方式使用本数据集,包括但不限于版权侵权、隐私泄露或敏感信息滥用。
- 尽管我们不强制要求遵循特定司法辖区的条款,但我们强烈建议用户确保其使用行为符合当地适用的法律法规。
- 使用特定子集需遵守其原始数据源的使用条款:具体而言,使用OmniCorpus-CC、OmniCorpus-CW和OmniCorpus-YT需分别遵守[Common Crawl使用条款](https://commoncrawl.org/terms-of-use)、[中国互联网数据安全管理相关规定](https://www.gov.cn/zhengce/content/202409/content_6977766.htm)以及[YouTube使用条款](https://www.youtube.com/terms)。
- 本使用条款并不取代原始数据源的使用条款。用户需确保对数据集内容的任何使用均符合原始数据源的使用条款及数据主体的相关权利。
# 引用
@inproceedings{li2024omnicorpus,
title={OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text},
author={Li, Qingyun and Chen, Zhe and Wang, Weiyun and Wang, Wenhai and Ye, Shenglong and Jin, Zhenjiang and others},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025}
}
提供机构:
maas
创建时间:
2024-10-23



