WanJuan3.0(万卷-丝路)多语言 多模态语料库
收藏OpenDataLab2026-06-07 更新2025-01-18 收录
下载链接:
https://opendatalab.org.cn/OpenDataLab/WanJuan3
下载链接
链接失效反馈官方服务:
资源简介:
WanJuan3.0(“万卷·丝路”)一个作为综合性的纯文本语料库,采集了多个国家地区的网络公开信息、文献、专利等资料,数据总规模超1.2TB,Token总数超过300B(300 billion),处于国际领先水平。首期开源的语料库主要由泰语、俄语、阿拉伯语、韩语和越南语5个子集构成,每个子集的数据规模均超过150GB。
基于“书生·浦语”智能标签分类体系,上海AI实验室研究团队将每个语料子集细分为7个大类和32个小类,覆盖历史、政治、文化、房产、购物、天气、餐饮、百科、专业知识等多类具有语言所在地特征内容,便于研究者根据具体需求检索数据,并可适应不同研究领域多样化需求。
WanJuan3.0 (officially named "Wanjuan·Silk Road") is a comprehensive pure-text corpus that collects publicly accessible online information, academic literature, patents and other materials from multiple countries and regions. It has a total data size exceeding 1.2 TB and a total token count surpassing 300B (300 billion), ranking at the international leading level. The first-phase open-sourced corpus mainly consists of five subsets in Thai, Russian, Arabic, Korean and Vietnamese, with each subset containing more than 150 GB of data. Based on the "Shusheng·Puyu" intelligent tagging classification system, the research team from Shanghai AI Laboratory subdivided each corpus subset into 7 major categories and 32 sub-categories, covering content with region-specific linguistic features including history, politics, culture, real estate, shopping, weather, catering, encyclopedic knowledge, professional knowledge and other relevant fields. This design facilitates researchers to retrieve targeted data according to specific needs and meets the diversified research demands across different fields.
提供机构:
OpenDataLab
创建时间:
2025-01-10
搜集汇总
数据集介绍

背景与挑战
背景概述
WanJuan3.0(万卷-丝路)是由上海人工智能实验室发布的多语言多模态语料库,旨在通过人工智能赋能'一带一路'建设。该数据集分两期开源:首期提供超过1.2TB的纯文本预训练语料,涵盖泰语、俄语等5种语言;第二期扩展为多模态数据,包括图文、音频等4种模态,覆盖8种语言,总计1150万条高质量数据,达到工业级标准。
以上内容由遇见数据集搜集并总结生成



