WanJuan
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/opendatalab/wanjuan1.0
下载链接
链接失效反馈官方服务:
资源简介:
该数据集由OpenDataLab提供,包含了超过6500个开放数据集,涵盖30多种数据格式,并支持50多种类型的任务。其中包括大规模预训练数据集如万卷数据集、图像文本对数据集Laion5B以及以视频为中心的多模态数据集InternVid。OpenDataLab为大型模型开发的所有阶段提供全面的 数据集支持,并采用标准化的数据集描述语言(DSDL)来提高数据的互操作性和可重用性。数据规模宏大,总量超过80TB,包括超过60亿张图像、8亿视频片段、1000亿个令牌、100万个3D模型以及2万小时的音频。这些数据集可支持包括预训练、微调及评估在内的各种人工智能任务。
This dataset is provided by OpenDataLab, which encompasses over 6,500 open datasets covering more than 30 data formats and supporting over 50 types of tasks. It includes large-scale pre-training datasets such as the WanJuan Dataset, the image-text pair dataset Laion5B, and the video-centric multimodal dataset InternVid. OpenDataLab provides comprehensive dataset support for all stages of large model development, and adopts the standardized Dataset Description Language (DSDL) to improve data interoperability and reusability. With a total scale exceeding 80 TB, the dataset collection contains over 6 billion images, 800 million video clips, 100 billion tokens, 1 million 3D models, and 20,000 hours of audio. These datasets support various artificial intelligence tasks including pre-training, fine-tuning and evaluation.
提供机构:
OpenDataLab



