万卷·丝路2.0(5个语种:阿语、俄语、韩语、越南语、泰语)
收藏魔搭社区2026-04-21 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/OpenDataLab/WanJuanSiLu2O
下载链接
链接失效反馈官方服务:
资源简介:
# 万卷·丝路 2.0 多模态多语言语料库
## 数据集介绍
全新升级的“万卷·丝路2.0”,带来以下三大核心提升:
- **语种数量显著扩充**:在“万卷·丝路1.0” 开源的阿语、俄语、韩语、越南语、泰语5个语种基础上,“万卷·丝路2.0”新增塞尔维亚语、匈牙利语、捷克语3个稀缺语料数据,以上述 8 个关键语种,助力全球多语言应用。
- **数据模态全面升级**:与万卷·丝路1.0纯文本数据不同,万卷·丝路2.0为 8 个语种均提供了丰富的图片-文本、音频-文本、视频-文本、特色指令微调SFT四大模态数据,覆盖多模态研究全链路;整体数据总量超过1150万条,音视频时长超过2.6万小时,极大地满足了多种研究任务的需求。
- **超精细数据,多场景适用**:经成熟数据生产管线及安全加固,结合机器与当地专家人工精细化地标注质检,“万卷·丝路2.0”达工业级数据质量标准,含20余种细粒度多维分类标签及详细的文本描述,适配文化旅游、商业贸易、科技教育等不同场景,开“箱”即用,助开发者减负,专注价值创造。
## 开源内容
图片-文本累计开源超过200W条;
音频-文本开源超过1600小时;
音频-文本开源超过2.5w小时;
SFT数据开源18w条;
开源数据详情:
|语种名称|图文模块数据量(张数)|音频模块时长(小时)|视频模块时长(小时)|SFT模块数据量|
|---|---|---|---|---|
|阿语|220,000|200|1738|23,000|
|俄语|250,000|212|3491|23,000|
|韩语|530,000|202|3412|23,000|
|越南语|450,000|205|2901|23,000|
|泰语|100,000|201|5684|23,000|
|塞尔维亚语|80,000|206|2578|23,000|
|匈牙利语|220,000|208|3470|23,000|
|捷克语|270,000|202|2453|23,000|
**⚠⚠⚠【说明】**
- 本仓库主要为这5个语种资源(阿语、俄语、韩语、越南语、泰语),登录后可以直接下载使用(无需申请)
- 其他3个语种(塞尔维亚语、匈牙利语、捷克语) 请访问这个页面([https://opendatalab.com/OpenDataLab/WanJuanSiLu2](https://opendatalab.com/OpenDataLab/WanJuanSiLu2)),点击申请,作者同意后即可下载使用。
## 数据处理特点:
#### 图片-文本数据:
- 多领域覆盖均衡:来自八个语种国家的维基百科、维基语录、百科全书及主流媒体新闻的高质量图片-文本数据;
- 双重标注创新:Alt-text基础描述 + 视觉模型生成扩展描述,提升信息丰富度;
- 10个高关注领域均匀分布,避免数据倾斜;标签构成:户外场景、室内场景、城市场景、乡村场景、文字科技、自然风光、民俗传统、成年人、食物;
#### 音频-文本数据:
- 音频双ASR校验保证超高质量:本数据集采自主流视频媒体平台转录的音频-文本数据,通过Google和Microsoft双商用ASR引擎交叉验证,确保高精准文本标注,并结合环境噪声消除技术,提高音质;
- 真实场景语音:包含环境噪声的自然对话数据,贴近实际应用,相比其他同类数据集,本数据集在多语种覆盖、对话真实性和标注质量方面具有明显优势;
- 4大数据分类:社会人文、娱乐媒体、学识教育、生活文化;
#### 视频-文本数据:
- 丰富的语种类别,填补数据空白:8种语言(含匈牙利语/塞尔维亚语等)视频总量超16,000小时;与同类数据集相比,该数据集包括了很多低资源语种,填补了这些语言在视频数据集中的空白,是多模态研究和低资源语种处理的宝贵资源;
- 多模态标注体系,构造细粒度标签与描述:同时提供视频画面标注、字幕标注以及视频画面与字幕整合标注三种形式,为多模态模型的研究与开发提供了更全面的信息支持;提供17类多维标签,满足多样化需求;
- 标签构成
<table>
<tr>
<th>一级标签</th>
<th>二级标签</th>
</tr>
<tr>
<td rowspan="4">通用</td>
<td>科技与战略</td>
</tr>
<tr>
<td>文化</td>
</tr>
<tr>
<td>电影与动画</td>
</tr>
<tr>
<td>旅行</td>
</tr>
<tr>
<td rowspan="3">人物</td>
<td>人物</td>
</tr>
<tr>
<td>动物</td>
</tr>
<tr>
<td>访谈</td>
</tr>
<tr>
<td rowspan="5">场景</td>
<td>音乐</td>
</tr>
<tr>
<td>游戏</td>
</tr>
<tr>
<td>新闻</td>
</tr>
<tr>
<td>教程</td>
</tr>
<tr>
<td>体育</td>
</tr>
<tr>
<td>其他</td>
<td>其他</td>
</tr>
</table>
#### 特色指令微调SFT数据:
- 文化对抗样本:包含本土居民设计的文化相关问答对,检测模型文化偏见
- 混合质检流程:规则+模型评分筛选翻译数据,降低低资源语种噪声
- 提供非英语文化语料(如本地生活/传统习俗),缓解英文数据主导的刻板印象
- 5大标签构成:文化、代码、本地生活、AI4S、数学
## 许可
万卷·丝路2.0 整体采用CC BY 4.0许可协议。您可以自由共享、改编该数据集,唯需遵循以下条件:
- 署名:您必须适当地标明作者、提供指向本协议的链接,以及指明是否(对原始数据集)做了修改。您可以以任何合理的方式这样做,但不能以任何方式暗示许可人同意您或您的使用。
- 没有附加限制:您不得使用法律条款或技术措施来限制他人执行许可证允许的任何操作。
完整协议内容,请访问[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)协议全文。
## 特别注意事项
请注意,本数据集的某些子集可能受制于其他协议规定。在使用特定子集之前,请务必仔细阅读相关协议,确保合规使用。更为详细的协议信息,请在特定子集的相关文档或元数据中查看。
OpenDataLab作为非盈利机构,倡导和谐友好的开源交流环境,若在开源数据集内发现有侵犯您合法权益的内容,可发送邮件至([OpenDataLab@pjlab.org.cn](mailto:OpenDataLab@pjlab.org.cn)),邮件中请写明侵权相关事实的详细描述并向我们提供相关的权属证明资料。我们将于3个工作日内启动调查处理机制,并采取必要的措施进行处置(如下架相关数据)。但您应确保您投诉的真实性,否则采取措施后所产生的不利后果应由您独立承担。
## 引文
使用 万卷·丝路2.0 ,请添加以下引文:
```
@misc{he2024opendatalabempoweringgeneralartificial,
title={OpenDataLab: Empowering General Artificial Intelligence with Open Datasets},
author={Conghui He and Wei Li and Zhenjiang Jin and Chao Xu and Bin Wang and Dahua Lin},
year={2024},
eprint={2407.13773},
archivePrefix={arXiv},
primaryClass={cs.DL},
url={https://arxiv.org/abs/2407.13773},
}
```
# Wanjuan·Silk Road 2.0 Multimodal Multilingual Corpus
## Dataset Introduction
Our newly upgraded "Wanjuan·Silk Road 2.0" brings three core improvements:
- **Significantly Expanded Language Coverage**: Based on the 5 languages (Arabic, Russian, Korean, Vietnamese, Thai) open-sourced in "Wanjuan·Silk Road 1.0", "Wanjuan·Silk Road 2.0" adds 3 scarce low-resource language corpora: Serbian, Hungarian, and Czech, totaling 8 key languages to empower global multilingual applications.
- **Comprehensive Upgrade of Data Modalities**: Different from the pure text data of "Wanjuan·Silk Road 1.0", "Wanjuan·Silk Road 2.0" provides rich four-modal data for all 8 languages: image-text, audio-text, video-text, and specialized instruction fine-tuning (SFT), covering the entire multimodal research pipeline. The total number of data entries exceeds 11.5 million, and the total duration of audio and video exceeds 26,000 hours, greatly meeting the needs of various research tasks.
- **Ultra-fine Data for Multi-scenario Applications**: After going through mature data production pipelines and security hardening, combined with fine-grained manual annotation and quality inspection by machines and local experts, "Wanjuan·Silk Road 2.0" reaches industrial-grade data quality standards. It contains more than 20 fine-grained multi-dimensional classification tags and detailed text descriptions, adapting to scenarios such as cultural tourism, commercial trade, technology and education, and is out-of-the-box ready to help developers reduce burdens and focus on value creation.
## Open-Sourced Content
Over 2 million image-text pairs have been open-sourced;
Over 1,600 hours of audio-text data;
Over 25,000 hours of audio-text data;
180,000 SFT samples have been open-sourced;
Detailed open-sourced data:
| Language Name | Image-Text Module Data Volume (Number of Samples) | Audio Module Duration (Hours) | Video Module Duration (Hours) | SFT Module Data Volume |
|---|---|---|---|---|
| Arabic | 220,000 | 200 | 1,738 | 23,000 |
| Russian | 250,000 | 212 | 3,491 | 23,000 |
| Korean | 530,000 | 202 | 3,412 | 23,000 |
| Vietnamese | 450,000 | 205 | 2,901 | 23,000 |
| Thai | 100,000 | 201 | 5,684 | 23,000 |
| Serbian | 80,000 | 206 | 2,578 | 23,000 |
| Hungarian | 220,000 | 208 | 3,470 | 23,000 |
| Czech | 270,000 | 202 | 2,453 | 23,000 |
**⚠⚠⚠ [Note]**
- This repository mainly provides resources for the 5 languages: Arabic, Russian, Korean, Vietnamese, and Thai. You can directly download and use them after logging in (no application required).
- For the other 3 languages (Serbian, Hungarian, Czech), please visit this page ("https://opendatalab.com/OpenDataLab/WanJuanSiLu2"), click to apply, and download after the author's approval.
## Data Processing Characteristics
#### Image-Text Data
- Balanced Coverage across Multiple Domains: High-quality image-text data sourced from Wikipedia, Wikiquote, encyclopedias, and mainstream media news from the 8 language countries.
- Innovative Dual Annotation: Alt-text basic description + extended description generated by visual models to improve information richness.
- Evenly distributed across 10 high-concern domains to avoid data skew. Tag composition: outdoor scenes, indoor scenes, urban scenes, rural scenes, text & technology, natural landscapes, folk traditions, adults, food.
#### Audio-Text Data
- Ultra-high quality ensured by dual ASR verification: The dataset collects audio-text data transcribed from mainstream video media platforms, cross-validated by Google and Microsoft commercial Automatic Speech Recognition (ASR) engines to ensure highly accurate text annotations, combined with environmental noise reduction technology to improve audio quality.
- Real-world scenario speech: Contains natural dialogue data with environmental noise, close to actual applications. Compared with similar datasets, this dataset has obvious advantages in multilingual coverage, dialogue authenticity, and annotation quality.
- 4 major data categories: social humanities, entertainment media, academic education, life and culture.
#### Video-Text Data
- Rich language categories, filling data gaps: The total duration of videos in 8 languages (including Hungarian/Serbian, etc.) exceeds 16,000 hours. Compared with similar datasets, this dataset includes many low-resource languages, filling the gap of these languages in video datasets, and is a valuable resource for multimodal research and low-resource language processing.
- Multimodal annotation system, constructing fine-grained tags and descriptions: Provides three forms of annotation: video frame annotation, subtitle annotation, and integrated annotation of video frames and subtitles, providing more comprehensive information support for the research and development of multimodal models. Provides 17 multi-dimensional tags to meet diverse needs.
- Tag composition
<table>
<tr>
<th>Primary Tag</th>
<th>Secondary Tag</th>
</tr>
<tr>
<td rowspan="4">General</td>
<td>Technology & Strategy</td>
</tr>
<tr>
<td>Culture</td>
</tr>
<tr>
<td>Film & Animation</td>
</tr>
<tr>
<td>Travel</td>
</tr>
<tr>
<td rowspan="3">Characters</td>
<td>People</td>
</tr>
<tr>
<td>Animals</td>
</tr>
<tr>
<td>Interview</td>
</tr>
<tr>
<td rowspan="5">Scenarios</td>
<td>Music</td>
</tr>
<tr>
<td>Gaming</td>
</tr>
<tr>
<td>News</td>
</tr>
<tr>
<td>Tutorial</td>
</tr>
<tr>
<td>Sports</td>
</tr>
<tr>
<td>Other</td>
<td>Other</td>
</tr>
</table>
#### Specialized Instruction Fine-tuning (SFT) Data
- Cultural Adversarial Samples: Contains culture-related question-answer pairs designed by local residents to detect model cultural bias.
- Hybrid Quality Inspection Process: Rules + model scoring to screen translation data, reducing noise in low-resource languages.
- Provides non-English cultural corpora (such as local life/traditional customs) to alleviate the stereotypes dominated by English data.
- 5 major tag compositions: Culture, Code, Local Life, AI4S, Mathematics.
## License
"Wanjuan·Silk Road 2.0" is licensed under the CC BY 4.0 license overall. You can freely share and adapt the dataset, subject to the following conditions:
- Attribution: You must appropriately indicate the author, provide a link to this license, and specify whether (the original dataset) has been modified. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
- No Additional Restrictions: You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
For the full license text, please visit the [CC BY 4.0]("https://creativecommons.org/licenses/by/4.0/") agreement.
## Special Notes
Please note that some subsets of this dataset may be subject to other agreement provisions. Before using a specific subset, please carefully read the relevant agreement to ensure compliant use. For more detailed agreement information, please check the relevant documentation or metadata of the specific subset.
OpenDataLab is a non-profit organization that advocates a harmonious and friendly open-source communication environment. If you find content in the open-source dataset that infringes your legitimate rights and interests, you can send an email to ("OpenDataLab@pjlab.org.cn"), please provide a detailed description of the infringement-related facts and relevant ownership proof materials. We will initiate an investigation and processing mechanism within 3 working days and take necessary measures (such as removing the relevant data). However, you must ensure the authenticity of your complaint, otherwise you shall bear the adverse consequences caused by the measures taken.
## Citation
When using Wanjuan·Silk Road 2.0, please add the following citation:
@misc{he2024opendatalabempoweringgeneralartificial,
title={OpenDataLab: Empowering General Artificial Intelligence with Open Datasets},
author={Conghui He and Wei Li and Zhenjiang Jin and Chao Xu and Bin Wang and Dahua Lin},
year={2024},
eprint={2407.13773},
archivePrefix={arXiv},
primaryClass={cs.DL},
url={https://arxiv.org/abs/2407.13773},
}
提供机构:
maas
创建时间:
2025-04-01
搜集汇总
数据集介绍

背景与挑战
背景概述
万卷·丝路2.0是一个多模态多语言语料库,覆盖8种语言(当前页面主要提供阿语、俄语、韩语、越南语、泰语5种语言资源),包含图像-文本、音频-文本、视频-文本和指令微调SFT数据,总数据量超过1150万条,音频视频时长超2.6万小时。数据集通过机器和专家细粒度标注,具有工业级质量,适用于文化旅游、商业贸易等多场景研究,支持全球多语言应用。
以上内容由遇见数据集搜集并总结生成



