WuDaoMM

Name: WuDaoMM
Creator: OpenDataLab
Published: 2026-05-24 08:30:36
License: 暂无描述

OpenDataLab2026-05-24 更新2024-05-09 收录

下载链接：

https://opendatalab.org.cn/OpenDataLab/WuDaoMM

下载链接

链接失效反馈

官方服务：

资源简介：

Wutaomm是北京智源人工智能研究院wutaocalpora开源数据集的一部分。去年，我们开源了全球最大的中文文本数据集，其中包括5TB的预训练文本数据。今年开源的wutaomm是图像和文本的多模态预训练数据。完整的数据集包含6.5亿对图像和文本。它为大规模的中国多模态预训练模型如Wenlan和Cogview提供了数据支持。数据集包含几千万对的强相关数据和6亿对弱相关数据。为了使研究人员更容易下载和使用，wudoomm-base的基本版本是开放的。该数据集由强相关数据组成，这些数据是根据类别以平衡的方式提取的。如果研究人员有完整的数据需求，他们可以通过data@baai.ac.cn给我们发送电子邮件。五道门-基地包含19大类，分别是: 能源、表情、产业、医疗、景观、动物、新闻、花卉、教育、艺术、人物、科学、海洋、树木、汽车、社会、科技、体育等。类别数据大约是70,000到400,000。

Wutaomm is part of the open-source Wutaocalpora dataset released by the Beijing Academy of Artificial Intelligence (BAAI). Last year, we open-sourced the world's largest Chinese text dataset, which contains 5 TB of pre-training text data. This year's newly released Wutaomm is a multimodal pre-training dataset focused on image-text pairs. The full dataset consists of 650 million image-text pairs, which supports large-scale Chinese multimodal pre-trained models such as Wenlan and Cogview. It includes tens of millions of highly correlated image-text pairs and 600 million weakly correlated pairs. To make it easier for researchers to download and use, the basic version Wutaomm-base is publicly available. This base dataset is composed of highly correlated data extracted in a balanced manner across categories. For researchers requiring the full dataset, please contact us via email at data@baai.ac.cn. Wutaomm-base includes 19 major categories, namely: energy, expressions, industry, healthcare, landscapes, animals, news, flowers, education, art, figures, science, ocean, trees, automobiles, society, technology, sports, and others. The number of samples per category ranges from approximately 70,000 to 400,000.

提供机构：

OpenDataLab

创建时间：

2023-03-22

搜集汇总

数据集介绍

背景与挑战

背景概述

WuDaoMM是一个由北京智源人工智能研究院和清华大学于2022年发布的中文多模态预训练数据集，属于WuDao Corpora的一部分。该数据集包含图像和文本对，完整版本有6.5亿对数据，其中开放的基本版本WuDaoMM-base由强相关数据组成，涵盖19个类别如能源、医疗、科技等，每个类别数据量在7万到40万之间，旨在支持大规模中文多模态模型训练如Wenlan和Cogview。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集