WuDaoCorpora-Text

Name: WuDaoCorpora-Text
Creator: OpenDataLab
Published: 2026-05-17 10:30:55
License: 暂无描述

OpenDataLab2026-05-17 更新2024-05-09 收录

下载链接：

https://opendatalab.org.cn/OpenDataLab/WuDaoCorpora-Text

下载链接

链接失效反馈

官方服务：

资源简介：

WuDaoCorpora是北京智源人工智能研究院（智源研究院）构建的大规模、高质量数据集，用于支撑大模型训练研究。目前由文本、对话、图文对、视频文本对四部分组成，分别致力于构建微型语言世界、提炼对话核心规律、打破图文模态壁垒、建立视频文字关联，为大模型训练提供坚实的数据支撑。

WuDaoCorpora is a large-scale, high-quality dataset developed by Beijing Academy of Artificial Intelligence (BAAI) to support research on large model training. Currently, it comprises four components: text corpora, dialogue datasets, image-text pairs, and video-text pairs. These components are respectively dedicated to constructing a miniature linguistic world, distilling core regularities of human dialogue, breaking the modality barriers between image and text modalities, and establishing the correlation between video and text, thereby providing solid data support for large model training.

提供机构：

OpenDataLab

创建时间：

2024-04-30

搜集汇总

数据集介绍