OmniCorpus-YT
收藏魔搭社区2025-12-26 更新2024-10-26 收录
下载链接:
https://modelscope.cn/datasets/OpenGVLab/OmniCorpus-YT
下载链接
链接失效反馈官方服务:
资源简介:
<p align="center">
<h1 align="center">🐳 OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text</h1>
</p>
This is the repository of OmniCorpus-YT, which contains 10 million image-text interleaved documents collected from Youtube videos.
- Repository: https://github.com/OpenGVLab/OmniCorpus
- Paper (ICLR 2025 Spotlight): https://arxiv.org/abs/2406.08418
OmniCorpus dataset is a large-scale image-text interleaved dataset, which pushes the boundaries of scale and diversity by encompassing **8.6 billion images** interleaved with **1,696 billion text tokens** from diverse sources, significantly surpassing previous datasets.
This dataset demonstrates several advantages over its counterparts:
1. **Larger data scale:** Our dataset is 1.7 times larger in images and 12.5 times larger in texts compared to the previously largest multimodal dataset, LAION-5B, while maintaining excellent data quality.
2. **Richer data diversity:** Drawing from a broader range of data sources, our dataset is more diverse than other image-text interleaved datasets. It includes bilingual multimodal data in both Chinese and English, and encompasses text-centric and vision-centric documents extracted from common websites and video platforms.
3. **More flexible format:** The streaming data format of our dataset offers exceptional flexibility, allowing adaptation to various data structures, including pure text corpora, image-text pairs, and interleaved data formats.
<img width="578" alt="image" src="https://github.com/OpenGVLab/OmniCorpus/assets/47669167/641a6427-ba50-41e6-8634-8810113fd803">
The OmniCorpus contains three sections:
- **OmniCorpus-CC**: processed from dumps in Common Crawl from 2013 to Nov./Dec. 2023.
- **OmniCorpus-CW**: sourced from Chinese internet resources, will be availiable on [OpenDataLab](https://opendatalab.com/) platform.
- **OmniCorpus-YT**: samples Youtube video frames as images and collects subtitles as texts.
Code for pre-training, evaluating, main body extracting, and filtering have been released in the official [repository](https://github.com/OpenGVLab/OmniCorpus). A pre-trained model is availiable [here](https://huggingface.co/Qingyun/OmniCorpus-InternVL).
# Usages
The image-text interleaved documents are recommanded for the following usages:
- Pre-training multimodal large language model (MLLM): Recent MLLMs (such as Flamingo series, EMU series, IDEFICS series, MM1, Cambrian-1, and xGen-MM) have shown that image-text interleaved data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning.
- Long text-image retrieval: We provide image-text similarities calculated with CLIP, which can convert the documents to image-text retrieval dataset with longer text. A retrieval model pre-trained on such data can retrieval images based on longer text, which can be used for multimodal RAG, converting pure text to multimodal sample, etc.
- Source for futher dataset research: Our data is large-scale, which can serve as the source for researches for data curation strategies. We provide many useful attributes as metadata for each document, which can enrich the filtering strategy and reduce the cost.
- ......
# Data Format
Following common practices, the data is organized into Parquet file format.
You might encounter errors when using `pandas.read_parquet` (because the data structure contains nested elements). We recommend using fastparquet to load the parquet files.
```Python
import fastparquet
df = fastparquet.ParquetFile(parquet_file_path).to_pandas()
# You can also use iter_batches
parquet_file = pq.ParquetFile(filepath)
for batch in parquet_file.iter_batches():
df = batch.to_pandas()
```
You can convert the i-th document and convert it into a dictionary.
```Python
doc_dict = df.iloc[i].to_dict()
```
The document format is as follow:
```json
{
'id': <str: youtube video id>,
'images': <bytes: list of image timestamps>,
'texts': <bytes: list of texts>
}
```
the images and texts can be loaded with `lambda s: json.loads(s)`
```json
'images': [
<str: key_frame_1_timestamp>,
None,
<str: key_frame_2_timestamp>,
None,
],
'texts': [
None,
<str: text_paragraph_1_content>
None,
<str: text_paragraph_2_content>,
]
```
The frame can be sampled from downloaded Youtube videos, we provide a python sampling tool:
```python
import os
import sys
import yt_dlp # pip install yt-dlp
import ffmpeg # brew install ffmpeg; pip install ffmpeg-python
import traceback
from multiprocessing import Pool
def download_hls_url(youtube_id):
video_url = f"https://www.youtube.com/watch?v={youtube_id}"
ydl_opts = {
'format': 'best',
'noplaylist': True,
'quiet': True,
}
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
info = ydl.extract_info(video_url, download=False)
return info['url']
def extract_frame(hls_url, timestamp, output_file):
try:
(
ffmpeg
.input(hls_url, ss=timestamp, protocol_whitelist='file,http,https,tcp,tls,httpproxy')
.output(output_file, vframes=1)
.run(quiet=True, capture_stdout=True, capture_stderr=True)
)
except ffmpeg.Error as e:
print(f"Error extracting frame at timestamp {timestamp}: {e}")
print("FFmpeg stderr output:\n", e.stderr.decode())
traceback.print_exc()
def extract_frames_with_hls(youtube_id, timestamps, output_dir='frames'):
if not os.path.exists(output_dir):
os.makedirs(output_dir)
hls_url = download_hls_url(youtube_id)
tasks = [(hls_url, timestamp, os.path.join(output_dir, f"{timestamp}.jpg")) for timestamp in timestamps]
with Pool() as pool:
pool.starmap(extract_frame, tasks)
if __name__ == "__main__":
extract_frames_with_hls("1xGiPUeevCM", [19.000000, 23.000000, 28.000000, 32.000000, 45.000000, 54.000000, 57.000000, 67.000000])
```
# License and Terms of Use
The OmniCorpus dataset is distributed under [the CC BY 4.0 License](https://creativecommons.org/licenses/by/4.0/). The open-source code is released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).
The Terms of Use (ToUs) have been developed based on widely accepted standards. By accessing or using this dataset, users acknowledge their responsibility to comply with all relevant legal, regulatory, and ethical standards.
- All users, whether from academia or industry, must comply with the ToUs outlined in the CC BY 4.0 License.
- Any derived datasets or models must acknowledge the use of the OmniCorpus dataset to maintain transparency.
- The OmniCorpus must not be used in any project involving sensitive content or harmful outcomes, including but not limited to political manipulation, hate speech generation, misinformation propagation, or tasks that perpetuate harmful stereotypes or biases.
- The use of this dataset in any manner that violates rights, such as copyright infringement, privacy breaches, or misuse of sensitive information, is strictly prohibited.
- While we do not enforce jurisdiction-specific terms, we strongly recommend that users ensure compliance with applicable local laws and regulations.
- The use of specific subset must comply with the ToUs of the primary source. Specifically, the use of OmniCorpus-CC, OmniCorpus-CW, and OmniCorpus-YT must comply with [the Common Crawl ToUs](https://commoncrawl.org/terms-of-use), the [regulations](https://www.gov.cn/zhengce/content/202409/content\_6977766.htm) on the security management of Internet data in China, and [YouTube’s ToUs](https://www.youtube.com/terms), respectively.
- These ToUs do not supersede the ToUs of the original content sources. Users must ensure that any use of the dataset’s content complies with the original ToUs and the rights of the data subjects.
# Citation
```
@inproceedings{li2024omnicorpus,
title={OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text},
author={Li, Qingyun and Chen, Zhe and Wang, Weiyun and Wang, Wenhai and Ye, Shenglong and Jin, Zhenjiang and others},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025}
}
```
# 🐳 OmniCorpus:十亿级图文交错统一多模态语料库
本仓库为OmniCorpus-YT子集的代码库,包含从YouTube视频中采集的1000万份图文交错文档。
- 仓库地址:https://github.com/OpenGVLab/OmniCorpus
- 论文(ICLR 2025 Spotlight):https://arxiv.org/abs/2406.08418
OmniCorpus数据集是一款大规模图文交错多模态语料库,通过纳入来自多源的**86亿张图像**与**16960亿文本Token**,突破了现有数据集的规模与多样性边界,显著超越了此前的同类数据集。本数据集相较于其他同类产品具有多项优势:
1. **更大的数据规模**:相较于此前规模最大的多模态数据集LAION-5B,本数据集的图像规模提升1.7倍,文本规模提升12.5倍,同时保持了优异的数据质量。
2. **更丰富的数据多样性**:本数据集依托更广泛的数据源,相较于其他图文交错数据集具有更强的多样性。其包含中英双语多模态数据,涵盖从通用网站与视频平台提取的以文本为中心和以视觉为中心的文档。
3. **更灵活的格式**:本数据集采用流式数据格式,具备极高的灵活性,可适配多种数据结构,包括纯文本语料库、图文对以及交错数据格式。

OmniCorpus数据集包含三个子集:
- **OmniCorpus-CC**:源自2013年至2023年11、12月的Common Crawl(通用网页爬虫库)数据转储文件。
- **OmniCorpus-CW**:源自中国互联网资源,将在[OpenDataLab](https://opendatalab.com/)平台上线。
- **OmniCorpus-YT**:采样YouTube视频帧作为图像,并采集字幕作为文本。
预训练、评估、主体提取与过滤的代码已在官方[仓库](https://github.com/OpenGVLab/OmniCorpus)中开源。预训练模型可在[此处](https://huggingface.co/Qingyun/OmniCorpus-InternVL)获取。
# 使用场景
推荐将图文交错文档用于以下场景:
- **多模态大语言模型(Multimodal Large Language Model,MLLM)预训练**:近期的MLLM(如Flamingo系列、EMU系列、IDEFICS系列、MM1、Cambrian-1及xGen-MM)已证明,图文交错数据有助于多模态上下文学习,并可在多模态微调过程中保留大语言模型的原有能力。
- **长文本-图像检索**:我们提供了基于CLIP计算的图文相似度,可将该类文档转换为支持更长文本的图文检索数据集。基于此类数据预训练的检索模型可根据长文本检索图像,可应用于多模态检索增强生成(Retrieval-Augmented Generation,RAG)、将纯文本转换为多模态样本等场景。
- **后续数据集研究的数据源**:本数据集规模庞大,可作为数据整理策略研究的基础数据源。我们为每份文档提供了丰富的元数据属性,可用于优化过滤策略并降低成本。
- ......
# 数据格式
遵循通用实践,本数据集采用Parquet文件格式进行组织。使用`pandas.read_parquet`时可能会遇到报错(因数据结构包含嵌套元素),我们推荐使用fastparquet加载Parquet文件。
Python
import fastparquet
df = fastparquet.ParquetFile(parquet_file_path).to_pandas()
# You can also use iter_batches
parquet_file = pq.ParquetFile(filepath)
for batch in parquet_file.iter_batches():
df = batch.to_pandas()
你可将第i份文档转换为字典格式:
Python
doc_dict = df.iloc[i].to_dict()
文档格式如下:
json
{
'id': <字符串类型:YouTube视频ID>,
'images': <字节类型:图像时间戳列表>,
'texts': <字节类型:文本列表>
}
可通过`lambda s: json.loads(s)`加载`images`与`texts`字段:
json
'images': [
<字符串类型:关键帧1时间戳>,
None,
<字符串类型:关键帧2时间戳>,
None,
],
'texts': [
None,
<字符串类型:文本段落1内容>,
None,
<字符串类型:文本段落2内容>,
]
可从下载的YouTube视频中采样帧,我们提供了Python采样工具:
python
import os
import sys
import yt_dlp # pip install yt-dlp
import ffmpeg # brew install ffmpeg; pip install ffmpeg-python
import traceback
from multiprocessing import Pool
def download_hls_url(youtube_id):
video_url = f"https://www.youtube.com/watch?v={youtube_id}"
ydl_opts = {
'format': 'best',
'noplaylist': True,
'quiet': True,
}
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
info = ydl.extract_info(video_url, download=False)
return info['url']
def extract_frame(hls_url, timestamp, output_file):
try:
(
ffmpeg
.input(hls_url, ss=timestamp, protocol_whitelist='file,http,https,tcp,tls,httpproxy')
.output(output_file, vframes=1)
.run(quiet=True, capture_stdout=True, capture_stderr=True)
)
except ffmpeg.Error as e:
print(f"Error extracting frame at timestamp {timestamp}: {e}")
print("FFmpeg stderr output:
", e.stderr.decode())
traceback.print_exc()
def extract_frames_with_hls(youtube_id, timestamps, output_dir='frames'):
if not os.path.exists(output_dir):
os.makedirs(output_dir)
hls_url = download_hls_url(youtube_id)
tasks = [(hls_url, timestamp, os.path.join(output_dir, f"{timestamp}.jpg")) for timestamp in timestamps]
with Pool() as pool:
pool.starmap(extract_frame, tasks)
if __name__ == "__main__":
extract_frames_with_hls("1xGiPUeevCM", [19.000000, 23.000000, 28.000000, 32.000000, 45.000000, 54.000000, 57.000000, 67.000000])
# 许可与使用条款
OmniCorpus数据集采用[CC BY 4.0协议](https://creativecommons.org/licenses/by/4.0/)进行分发。开源代码采用[Apache License 2.0协议](https://www.apache.org/licenses/LICENSE-2.0)进行开源。
本使用条款(ToUs)基于广泛接受的标准制定。用户访问或使用本数据集即表明其承诺遵守所有相关法律、监管及伦理标准。
- 所有用户,无论来自学术界还是工业界,均需遵守CC BY 4.0协议中规定的使用条款。
- 任何衍生数据集或模型均需注明使用了OmniCorpus数据集,以保证透明度。
- 严禁将OmniCorpus用于涉及敏感内容或有害结果的项目,包括但不限于政治操纵、仇恨言论生成、虚假信息传播,或助长有害刻板印象与偏见的任务。
- 严禁以任何侵犯他人权利的方式使用本数据集,包括但不限于侵犯版权、隐私泄露或滥用敏感信息。
- 尽管我们不强制要求遵守特定司法辖区的条款,但我们强烈推荐用户确保遵守适用的当地法律法规。
- 使用特定子集需遵守其原始数据源的使用条款。具体而言,使用OmniCorpus-CC、OmniCorpus-CW及OmniCorpus-YT需分别遵守[Common Crawl使用条款](https://commoncrawl.org/terms-of-use)、中国[互联网数据安全管理规定](https://www.gov.cn/zhengce/content/202409/content_6977766.htm)及[YouTube使用条款](https://www.youtube.com/terms)。
- 本使用条款不替代原始内容源的使用条款。用户需确保数据集内容的任何使用均符合原始使用条款及数据主体的相关权利。
# 引用格式
@inproceedings{li2024omnicorpus,
title={OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text},
author={Li, Qingyun and Chen, Zhe and Wang, Weiyun and Wang, Wenhai and Ye, Shenglong and Jin, Zhenjiang and others},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025}
}
提供机构:
maas
创建时间:
2024-10-23



