资源简介:
OmniCorpus由上海人工智能实验室联合多所知名高校及研究机构共同构建,是迄今为止最大的多模态数据集。该数据集包含了86亿张图像和1696亿个文本Token,支持中英双语。与现有的数据集相比,其在以下方面具有显著优势:1)更大的数据规模:与之前最大的多模态数据集LAION-5B相比,OmniCorpus的数据集在图像方面大了1.7倍,在文本方面大了12.5倍,同时保持了出色的数据质量。2)更丰富的数据多样性:从更广泛的数据源中提取数据,OmniCorpus数据集比其他图像-文本交错数据集更具多样性。它包括中英文双语多模态数据,并包括从常见网站和视频平台提取的以文本为中心和以视觉为中心的文档。3)更灵活的格式:OmniCorpus的流式数据格式提供了非凡的灵活性,允许适应各种数据结构,包括纯文本语料库、图像-文本对和交错数据格式。数据集制作pipeline由五个关键阶段组成:主体提取、初步文本过滤、文档重复数据消除、图像下载和过滤以及详细文本过滤。每个阶段都有效地减少数据集,只保留高质量的数据。OmniCorpus的多语言特性和高质量数据为多模态机器学习模型提供了丰富的训练资源,推动了人工智能领域的研究进展。
OmniCorpus, jointly constructed by the Shanghai Artificial Intelligence Laboratory and several renowned universities and research institutions, stands as the largest multimodal dataset to date. This dataset encompasses 8.6 billion images and 169.6 billion text tokens, supporting both Chinese and English languages. Compared to existing datasets, OmniCorpus exhibits significant advantages in the following aspects: 1) Larger data scale: Compared to the previously largest multimodal dataset, LAION-5B, OmniCorpus is 1.7 times larger in terms of images and 12.5 times larger in terms of text, while maintaining excellent data quality. 2) Richer data diversity: By extracting data from a broader range of sources, OmniCorpus offers greater diversity than other image-text interleaved datasets. It includes bilingual multimodal data in Chinese and English, incorporating both text-centric and vision-centric documents sourced from common websites and video platforms. 3) More flexible format: The streaming data format of OmniCorpus provides exceptional flexibility, allowing adaptation to various data structures, including plain text corpora, image-text pairs, and interleaved data formats. The dataset creation pipeline consists of five key stages: entity extraction, preliminary text filtering, document deduplication, image downloading and filtering, and detailed text filtering. Each stage effectively reduces the dataset, retaining only high-quality data. The multilingual features and high-quality data of OmniCorpus provide rich training resources for multimodal machine learning models, advancing research in the field of artificial intelligence.
提供机构:
上海人工智能实验室、哈尔滨工业大学、南京大学、复旦大学等