WanJuanSiLu-Multimodal-3Languages
收藏魔搭社区2026-01-06 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/OpenDataLab/WanJuanSiLu-Multimodal-3Languages
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
---
# WanJuan·SiLu Multimodal Multilingual Corpus
## 🌏Dataset Introduction
The newly upgraded "Wanjuan·Silk Road Multimodal Corpus" brings the following three core improvements:
- **The number of languages has been significantly expanded:** Based on the five open-source languages of "Wanjuan·Silk Road", namely Arabic, Russian, Korean, Vietnamese, and Thai, "Wanjuan·Silk Road Multimodal" has added three scarce corpus data of Serbian, Hungarian, and Czech, and uses the above eight key languages to help global multilingual applications.
- **Data modality has been fully upgraded:** Different from the pure text data of the first version of "Wanjuan·Silk Road", "Wanjuan·Silk Road Multimodal" provides rich four modal data of pictures-text, audio-text, video-text, and special instruction fine-tuning SFT for all eight languages, covering the entire link of multimodal research; the total amount of data exceeds 11.5 million, and the audio and video duration exceeds 26,000 hours, which greatly meets the needs of various research tasks.
- **Ultra-fine data, applicable to multiple scenarios:** After mature data production pipelines and security reinforcement, combined with machine and local experts' manual fine-grained labeling and quality inspection, "Wanjuan·Silk Road Multimodal" reaches industrial-grade data quality standards, including more than 20 kinds of fine-grained multi-dimensional classification labels and detailed text descriptions, suitable for different scenarios such as cultural tourism, commercial trade, science and technology education, and can be used out of the box, helping developers reduce their burden and focus on value creation.
## 🚩Open source content
- More than 2 million pictures and texts have been open sourced;
- more than 1,600 hours of audio and text have been open sourced;
- more than 25,000 hours of audio and text have been open sourced;
- 180,000 SFT data have been open sourced;
## 📚Open source data details:
|Language name|Picture and text module data volume (number of pictures)|Audio module duration (hours)|Video module duration (hours)|SFT module data volume|
|---|---|---|---|---|
|Arabic|220,000|200|1738|23,000|
|Russian|250,000|212|3491|23,000|
|Korean|530,000|202|3412|23,000|
|Vietnamese|450,000|205|2901|23,000|
|Thai|100,000|201|5684|23,000|
|Serbian|80,000|206|2578|23,000|
|Hungarian|220,000|208|3470|23,000|
|Czech|270,000|202|2453|23,000|
### ⚠⚠⚠ This repository contains multimodal data resources for 3 languages(Serbian, Hungarian, and Czech). To access these resources, click the "Apply" button on the dataset file page. The data will be available for download after author approval.
- For the other 5 languages (Arabic, Russian, Korean, Vietnamese, Thai), visit this link to download directly (no application required):
[https://opendatalab.com/OpenDataLab/WanJuanSiLu2O](https://opendatalab.com/OpenDataLab/WanJuanSiLu2O)
- The first batch of open source plain text corpora in five languages can be visited on this page and downloaded directly after logging in:
- WanJuan-Thai:[https://opendatalab.com/OpenDataLab/WanJuan-Thai](https://opendatalab.com/OpenDataLab/WanJuan-Thai)
- WanJuan-Russian:[https://opendatalab.com/OpenDataLab/WanJuan-Russian](https://opendatalab.com/OpenDataLab/WanJuan-Russian)
- WanJuan-Korean:[https://opendatalab.com/OpenDataLab/WanJuan-Korean](https://opendatalab.com/OpenDataLab/WanJuan-Korean)
- WanJuan-Vietnamese:[https://opendatalab.com/OpenDataLab/WanJuan-Vietnamese](https://opendatalab.com/OpenDataLab/WanJuan-Vietnamese)
- WanJuan-Arabic:[https://opendatalab.com/OpenDataLab/WanJuan-Arabic](https://opendatalab.com/OpenDataLab/WanJuan-Arabic)
## Data processing features:
#### 📸Image-text data:
- Balanced coverage of multiple fields: high-quality image-text data from Wikipedia, Wikiquote, encyclopedia and mainstream media news from eight language countries;
- Double annotation innovation: Alt-text basic description + visual model generation extended description to improve information richness;
- Evenly distributed in 10 high-interest areas to avoid data skew; Label composition: outdoor scenes, indoor scenes, urban scenes, rural scenes, text technology, natural scenery, folk traditions, adults, food;
#### 🎥Audio-text data:
- Dual ASR verification of audio ensures ultra-high quality: This dataset uses audio-text data transcribed from the main streaming video media platform, cross-validated by Google and Microsoft dual commercial ASR engines to ensure high-precision text annotation, and combined with environmental noise elimination technology to improve sound quality;
- Real scene voice: natural conversation data containing environmental noise, close to actual applications, compared with other similar datasets, this dataset has multi-language coverage, conversation authenticity and annotation quality. It has obvious advantages;
- 4 big data categories: social humanities, entertainment media, knowledge education, life culture;
#### 📞Video-text data:
- Rich language categories, filling data gaps: the total amount of videos in 8 languages (including Hungarian/Serbian, etc.) exceeds 16,000 hours; compared with similar data sets, this data set includes many low-resource languages, filling the gaps of these languages in video data sets, and is a valuable resource for multimodal research and low-resource language processing;
- Multimodal annotation system, constructing fine-grained labels and descriptions: It provides three forms of video screen annotation, subtitle annotation, and video screen and subtitle integrated annotation at the same time, providing more comprehensive information support for the research and development of multimodal models; It provides 17 types of multidimensional labels to meet diverse needs;
- Label composition
<table>
<tr>
<th>First-level label</th>
<th>Secondary tags</th>
</tr>
<tr>
<td rowspan="4">General</td>
<td>Technology and Strategy</td>
</tr>
<tr>
<td>Culture</td>
</tr>
<tr>
<td>Movies and Animation</td>
</tr>
<tr>
<td>Travel</td>
</tr>
<tr>
<td rowspan="3">Characters</td>
<td>Characters</td>
</tr>
<tr>
<td>Animal</td>
</tr>
<tr>
<td>Interviews</td>
</tr>
<tr>
<td rowspan="5">Scenes</td>
<td>Music</td>
</tr>
<tr>
<td>Games</td>
</tr>
<tr>
<td>News</td>
</tr>
<tr>
<td>Tutorials</td>
</tr>
<tr>
<td>Sports</td>
</tr>
<tr>
<td>Others</td>
<td>Others</td>
</tr>
</table>
#### 🤖Featured instructions for fine-tuning SFT data:
- Cultural adversarial samples: Contains culturally relevant question-answer pairs designed by local residents to detect cultural bias in models
- Hybrid quality inspection process: Rules + model scoring to filter translation data and reduce noise in low-resource languages
- Provide non-English cultural corpus (such as local life/traditional customs) to alleviate stereotypes dominated by English data
- Five major tags: culture, code, local life, AI4S, mathematics
## License
WanJuan·SiLu Multimodal dataset adopts CC BY 4.0 license agreement as a whole. You can freely share and adapt this dataset, but you must follow the following conditions:
- Attribution: You must appropriately indicate the author, provide a link to this agreement, and indicate whether (the original dataset) has been modified. You can do this in any reasonable way, but you cannot imply that the licensor agrees with you or your use in any way.
- No additional restrictions: You may not use legal terms or technical measures to restrict others from performing any operations permitted by the license. For the full agreement, please visit the full text of CC BY 4.0 agreement.
## Special Notes
Please note that some subsets of this dataset may be subject to other agreement provisions. Before using a specific subset, please be sure to read the relevant agreement carefully to ensure compliance. For more detailed agreement information, please check the relevant documents or metadata of the specific subset.
As a non-profit organization, OpenDataLab advocates a harmonious and friendly open source communication environment. If you find any content infringing your legal rights in the open source dataset, you can send an email to (OpenDataLab@pjlab.org.cn). Please write a detailed description of the infringement facts in the email and provide us with relevant ownership proof materials. We will initiate the investigation and handling mechanism within 3 working days and take necessary measures to deal with it (such as listing the relevant data). However, you should ensure the authenticity of your complaint, otherwise the adverse consequences of taking measures shall be borne by you independently.
## Citations
Using Wanjuan·Silk Road Multimodality, please add the following citation:
```
@misc{he2024opendatalabempoweringgeneralartificial,
title={OpenDataLab: Empowering General Artificial Intelligence with Open Datasets},
author={Conghui He and Wei Li and Zhenjiang Jin and Chao Xu and Bin Wang and Dahua Lin},
year={2024},
eprint={2407.13773},
archivePrefix={arXiv},
primaryClass={cs.DL},
url={https://arxiv.org/abs/2407.13773},
}
---
license: cc-by-4.0
---
# 万卷·丝路多模态多语言语料库
## 🌏数据集介绍
本次全新升级的「万卷·丝路多模态语料库」带来三大核心改进:
- **语言覆盖规模显著扩容**:基于初代「万卷·丝路」的五种开源语言(阿拉伯语、俄语、韩语、越南语、泰语),「万卷·丝路多模态」新增塞尔维亚语、匈牙利语、捷克语三类稀缺语料数据,覆盖上述八种核心语言,助力全球多语言应用落地。
- **数据模态全面升级**:与初代「万卷·丝路」的纯文本数据不同,「万卷·丝路多模态」提供丰富的图文、音频-文本、视频-文本四类模态数据,并针对八种语言推出专用指令监督微调(Supervised Fine-Tuning,SFT)数据,覆盖多模态研究全链路;总数据量超1150万,音视频总时长超26000小时,可充分满足各类研究任务需求。
- **超精细数据适配多场景**:经过成熟的数据生产管线与安全加固,结合机器标注与本地专家人工细粒度标注及质量检验,「万卷·丝路多模态」达到工业级数据质量标准,包含20余种细粒度多维度分类标签与详细文本描述,适配文旅、商贸、科教等多元场景,支持开箱即用,帮助开发者降低研发负担,聚焦核心价值创造。
## 🚩开源内容
- 开源超200万条图文数据;
- 开源超1600小时音频-文本数据;
- 开源超25000小时音频-文本数据;
- 开源18万条SFT微调数据;
## 📚开源数据详情:
|语言名称|图文模块数据量(图片数量)|音频模块时长(小时)|视频模块时长(小时)|SFT模块数据量|
|---|---|---|---|---|
|阿拉伯语|220,000|200|1738|23,000|
|俄语|250,000|212|3491|23,000|
|韩语|530,000|202|3412|23,000|
|越南语|450,000|205|2901|23,000|
|泰语|100,000|201|5684|23,000|
|塞尔维亚语|80,000|206|2578|23,000|
|匈牙利语|220,000|208|3470|23,000|
|捷克语|270,000|202|2453|23,000|
### ⚠⚠⚠ 本仓库包含塞尔维亚语、匈牙利语、捷克语三种语言的多模态数据资源。如需获取这些资源,请点击数据集文件页面的「申请」按钮,经作者审核通过后方可下载。
- 其余五种语言(阿拉伯语、俄语、韩语、越南语、泰语)可直接通过以下链接下载(无需申请):
[https://opendatalab.com/OpenDataLab/WanJuanSiLu2O](https://opendatalab.com/OpenDataLab/WanJuanSiLu2O)
- 首批开源的五种语言纯文本语料可通过以下页面访问,登录后可直接下载:
- 万卷-泰语:[https://opendatalab.com/OpenDataLab/WanJuan-Thai](https://opendatalab.com/OpenDataLab/WanJuan-Thai)
- 万卷-俄语:[https://opendatalab.com/OpenDataLab/WanJuan-Russian](https://opendatalab.com/OpenDataLab/WanJuan-Russian)
- 万卷-韩语:[https://opendatalab.com/OpenDataLab/WanJuan-Korean](https://opendatalab.com/OpenDataLab/WanJuan-Korean)
- 万卷-越南语:[https://opendatalab.com/OpenDataLab/WanJuan-Vietnamese](https://opendatalab.com/OpenDataLab/WanJuan-Vietnamese)
- 万卷-阿拉伯语:[https://opendatalab.com/OpenDataLab/WanJuan-Arabic](https://opendatalab.com/OpenDataLab/WanJuan-Arabic)
## 数据处理特性:
#### 📸图文数据:
- 多领域均衡覆盖:涵盖八种语言国家的维基百科、Wikiquote、百科全书及主流媒体新闻的高质量图文数据;
- 双标注创新模式:采用替代文本(Alt-text)基础描述+视觉模型生成扩展描述的方式,提升信息丰富度;
- 覆盖10个高关注度领域,避免数据倾斜;标签包含:室外场景、室内场景、城市场景、乡村场景、文本科技、自然风景、民俗传统、人物、食品。
#### 🎥音频-文本数据:
- 双自动语音识别(Automatic Speech Recognition,ASR)验证保障超高质量:本数据集使用从主流流媒体视频平台转录的音频-文本数据,经Google与Microsoft双商用ASR引擎交叉验证,确保高精度文本标注,并结合环境降噪技术提升音质;
- 真实场景语音:包含带有环境噪声的自然对话数据,贴近实际应用场景;相较同类数据集,本数据集在多语言覆盖范围、对话真实性与标注质量上优势显著;
- 涵盖四大数据类别:社会人文、娱乐传媒、知识教育、生活文化。
#### 📞视频-文本数据:
- 语言品类丰富,填补数据空白:八种语言(含匈牙利语、塞尔维亚语等)的视频总时长超16000小时;相较同类数据集,本数据集包含多种低资源语言,填补了这些语言在视频数据集领域的空白,是多模态研究与低资源语言处理的宝贵资源;
- 多模态标注体系,构建细粒度标签与描述:同时提供视频画面标注、字幕标注、画面与字幕集成标注三种形式,为多模态模型研发提供更全面的信息支持;提供17类多维标签,满足多样化需求;
- 标签组成如下表:
<table>
<tr>
<th>一级标签</th>
<th>二级标签</th>
</tr>
<tr>
<td rowspan="4">通用</td>
<td>科技与战略</td>
</tr>
<tr>
<td>文化</td>
</tr>
<tr>
<td>影视与动画</td>
</tr>
<tr>
<td>旅游</td>
</tr>
<tr>
<td rowspan="3">人物与主体</td>
<td>人物</td>
</tr>
<tr>
<td>动物</td>
</tr>
<tr>
<td>访谈</td>
</tr>
<tr>
<td rowspan="5">场景与主题</td>
<td>音乐</td>
</tr>
<tr>
<td>游戏</td>
</tr>
<tr>
<td>新闻</td>
</tr>
<tr>
<td>教程</td>
</tr>
<tr>
<td>体育</td>
</tr>
<tr>
<td>其他</td>
<td>其他</td>
</tr>
</table>
#### 🤖专用指令微调SFT数据特色说明:
- 文化对抗样本:包含本地居民设计的文化相关问答对,用于检测模型中的文化偏见;
- 混合质检流程:采用规则+模型评分的方式过滤翻译数据,降低低资源语言中的数据噪声;
- 提供非英语文化语料(如本地生活/传统习俗),缓解以英语数据为主导的刻板印象;
- 包含五大标签:文化、代码、本地生活、AI4S、数学。
## 许可协议
万卷·丝路多模态数据集整体采用CC BY 4.0许可协议。您可自由分享并改编本数据集,但需遵守以下条款:
- 署名要求:您必须合理标注原作者,提供本协议的链接,并说明原始数据集是否已被修改。您可采用任何合理方式进行标注,但不得暗示许可方同意您或您的使用行为。
- 无额外限制:您不得使用法律条款或技术措施限制他人执行本许可允许的任何操作。完整协议请参阅CC BY 4.0协议全文。
## 特殊说明
请注意,本数据集的部分子集可能受其他协议条款约束。在使用特定子集前,请务必仔细阅读相关协议以确保合规。如需更详细的协议信息,请查阅特定子集的相关文档或元数据。
作为非营利组织,OpenDataLab倡导和谐友好的开源交流环境。若您发现开源数据集中存在侵犯您合法权益的内容,请发送邮件至OpenDataLab@pjlab.org.cn。邮件中请详细说明侵权事实,并提供相关所有权证明材料。我们将在3个工作日内启动调查处理机制,并采取必要措施进行处置(如下架相关数据)。但您需确保投诉内容真实,否则由此产生的不利后果将由您自行承担。
## 引用方式
使用「万卷·丝路多模态」数据集时,请添加如下引用:
@misc{he2024opendatalabempoweringgeneralartificial,
title={"OpenDataLab: Empowering General Artificial Intelligence with Open Datasets"},
author={Conghui He and Wei Li and Zhenjiang Jin and Chao Xu and Bin Wang and Dahua Lin},
year={2024},
eprint={2407.13773},
archivePrefix={arXiv},
primaryClass={cs.DL},
url={https://arxiv.org/abs/2407.13773},
}
提供机构:
maas
创建时间:
2025-11-26



