five

WanJuanSiLu-Multimodal-5Languages

收藏
魔搭社区2025-12-04 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/OpenDataLab/WanJuanSiLu-Multimodal-5Languages
下载链接
链接失效反馈
官方服务:
资源简介:
# WanJuan·SiLu Multimodal Multilingual Corpus ## 🌏Dataset Introduction The newly upgraded "Wanjuan·Silk Road Multimodal Corpus" brings the following three core improvements: - **The number of languages has been significantly expanded:** Based on the five open-source languages ​​of "Wanjuan·Silk Road", namely Arabic, Russian, Korean, Vietnamese, and Thai, "Wanjuan·Silk Road Multimodal" has added three scarce corpus data of Serbian, Hungarian, and Czech, and uses the above eight key languages ​​to help global multilingual applications. - **Data modality has been fully upgraded:** Different from the pure text data of the first version of "Wanjuan·Silk Road", "Wanjuan·Silk Road Multimodal" provides rich four modal data of pictures-text, audio-text, video-text, and special instruction fine-tuning SFT for all eight languages, covering the entire link of multimodal research; the total amount of data exceeds 11.5 million, and the audio and video duration exceeds 26,000 hours, which greatly meets the needs of various research tasks. - **Ultra-fine data, applicable to multiple scenarios:** After mature data production pipelines and security reinforcement, combined with machine and local experts' manual fine-grained labeling and quality inspection, "Wanjuan·Silk Road Multimodal" reaches industrial-grade data quality standards, including more than 20 kinds of fine-grained multi-dimensional classification labels and detailed text descriptions, suitable for different scenarios such as cultural tourism, commercial trade, science and technology education, and can be used out of the box, helping developers reduce their burden and focus on value creation. ## 🚩Open source content - More than 2 million pictures and texts have been open sourced; - more than 1,600 hours of audio and text have been open sourced; - more than 25,000 hours of audio and text have been open sourced; - 180,000 SFT data have been open sourced; ## 📚Open source data details: |Language name|Picture and text module data volume (number of pictures)|Audio module duration (hours)|Video module duration (hours)|SFT module data volume| |---|---|---|---|---| |Arabic|220,000|200|1738|23,000| |Russian|250,000|212|3491|23,000| |Korean|530,000|202|3412|23,000| |Vietnamese|450,000|205|2901|23,000| |Thai|100,000|201|5684|23,000| |Serbian|80,000|206|2578|23,000| |Hungarian|220,000|208|3470|23,000| |Czech|270,000|202|2453|23,000| ### ⚠⚠⚠ This repository mainly provides multimodal resources in these 5 languages (Arabic, Russian, Korean, Vietnamese, Thai). You can download and use them directly after logging in (no application is required) - For multimodal resources in other 3 languages (Serbian, Hungarian, Czech), please visit this page ([https://opendatalab.com/OpenDataLab/WanJuanSiLu2](https://opendatalab.com/OpenDataLab/WanJuanSiLu2)), click to apply, and you can download and use them after the author agrees. - The first batch of open source plain text corpora in five languages can be visited on this page and downloaded directly after logging in: - WanJuan-Thai:[https://opendatalab.com/OpenDataLab/WanJuan-Thai](https://opendatalab.com/OpenDataLab/WanJuan-Thai) - WanJuan-Russian:[https://opendatalab.com/OpenDataLab/WanJuan-Russian](https://opendatalab.com/OpenDataLab/WanJuan-Russian) - WanJuan-Korean:[https://opendatalab.com/OpenDataLab/WanJuan-Korean](https://opendatalab.com/OpenDataLab/WanJuan-Korean) - WanJuan-Vietnamese:[https://opendatalab.com/OpenDataLab/WanJuan-Vietnamese](https://opendatalab.com/OpenDataLab/WanJuan-Vietnamese) - WanJuan-Arabic:[https://opendatalab.com/OpenDataLab/WanJuan-Arabic](https://opendatalab.com/OpenDataLab/WanJuan-Arabic) ## Data processing features: #### 📸Image-text data: - Balanced coverage of multiple fields: high-quality image-text data from Wikipedia, Wikiquote, encyclopedia and mainstream media news from eight language countries; - Double annotation innovation: Alt-text basic description + visual model generation extended description to improve information richness; - Evenly distributed in 10 high-interest areas to avoid data skew; Label composition: outdoor scenes, indoor scenes, urban scenes, rural scenes, text technology, natural scenery, folk traditions, adults, food; #### 🎥Audio-text data: - Dual ASR verification of audio ensures ultra-high quality: This dataset uses audio-text data transcribed from the main streaming video media platform, cross-validated by Google and Microsoft dual commercial ASR engines to ensure high-precision text annotation, and combined with environmental noise elimination technology to improve sound quality; - Real scene voice: natural conversation data containing environmental noise, close to actual applications, compared with other similar datasets, this dataset has multi-language coverage, conversation authenticity and annotation quality. It has obvious advantages; - 4 big data categories: social humanities, entertainment media, knowledge education, life culture; #### 📞Video-text data: - Rich language categories, filling data gaps: the total amount of videos in 8 languages ​​(including Hungarian/Serbian, etc.) exceeds 16,000 hours; compared with similar data sets, this data set includes many low-resource languages, filling the gaps of these languages ​​in video data sets, and is a valuable resource for multimodal research and low-resource language processing; - Multimodal annotation system, constructing fine-grained labels and descriptions: It provides three forms of video screen annotation, subtitle annotation, and video screen and subtitle integrated annotation at the same time, providing more comprehensive information support for the research and development of multimodal models; It provides 17 types of multidimensional labels to meet diverse needs; - Label composition <table> <tr> <th>First-level label</th> <th>Secondary tags</th> </tr> <tr> <td rowspan="4">General</td> <td>Technology and Strategy</td> </tr> <tr> <td>Culture</td> </tr> <tr> <td>Movies and Animation</td> </tr> <tr> <td>Travel</td> </tr> <tr> <td rowspan="3">Characters</td> <td>Characters</td> </tr> <tr> <td>Animal</td> </tr> <tr> <td>Interviews</td> </tr> <tr> <td rowspan="5">Scenes</td> <td>Music</td> </tr> <tr> <td>Games</td> </tr> <tr> <td>News</td> </tr> <tr> <td>Tutorials</td> </tr> <tr> <td>Sports</td> </tr> <tr> <td>Others</td> <td>Others</td> </tr> </table> #### 🤖Featured instructions for fine-tuning SFT data: - Cultural adversarial samples: Contains culturally relevant question-answer pairs designed by local residents to detect cultural bias in models - Hybrid quality inspection process: Rules + model scoring to filter translation data and reduce noise in low-resource languages - Provide non-English cultural corpus (such as local life/traditional customs) to alleviate stereotypes dominated by English data - Five major tags: culture, code, local life, AI4S, mathematics ## License WanJuan·SiLu Multimodal dataset adopts CC BY 4.0 license agreement as a whole. You can freely share and adapt this dataset, but you must follow the following conditions: - Attribution: You must appropriately indicate the author, provide a link to this agreement, and indicate whether (the original dataset) has been modified. You can do this in any reasonable way, but you cannot imply that the licensor agrees with you or your use in any way. - No additional restrictions: You may not use legal terms or technical measures to restrict others from performing any operations permitted by the license. For the full agreement, please visit the full text of CC BY 4.0 agreement. ## Special Notes Please note that some subsets of this dataset may be subject to other agreement provisions. Before using a specific subset, please be sure to read the relevant agreement carefully to ensure compliance. For more detailed agreement information, please check the relevant documents or metadata of the specific subset. As a non-profit organization, OpenDataLab advocates a harmonious and friendly open source communication environment. If you find any content infringing your legal rights in the open source dataset, you can send an email to (OpenDataLab@pjlab.org.cn). Please write a detailed description of the infringement facts in the email and provide us with relevant ownership proof materials. We will initiate the investigation and handling mechanism within 3 working days and take necessary measures to deal with it (such as listing the relevant data). However, you should ensure the authenticity of your complaint, otherwise the adverse consequences of taking measures shall be borne by you independently. ## Citations Using Wanjuan·Silk Road Multimodality, please add the following citation: ``` @misc{he2024opendatalabempoweringgeneralartificial, title={OpenDataLab: Empowering General Artificial Intelligence with Open Datasets}, author={Conghui He and Wei Li and Zhenjiang Jin and Chao Xu and Bin Wang and Dahua Lin}, year={2024}, eprint={2407.13773}, archivePrefix={arXiv}, primaryClass={cs.DL}, url={https://arxiv.org/abs/2407.13773}, }

# 万卷·丝路多模态多语言语料库(WanJuan·SiLu Multimodal Multilingual Corpus) ## 🌏数据集介绍 本次全新升级的「万卷·丝路多模态语料库」带来三大核心改进: - **语言覆盖规模大幅扩容**:相较于初代「万卷·丝路」的5种开源语言(阿拉伯语、俄语、韩语、越南语、泰语),「万卷·丝路多模态语料库」新增塞尔维亚语、匈牙利语、捷克语3种稀缺语料资源,依托上述8种核心语言助力全球多语言应用落地。 - **数据模态全面升级**:区别于初代「万卷·丝路」的纯文本数据,「万卷·丝路多模态语料库」可提供覆盖图文、音频-文本、视频-文本三大模态以及全语言专属监督微调(Supervised Fine-Tuning, SFT)数据的四类丰富模态资源,覆盖多模态研究全链路;总数据量超1150万条,音视频总时长超26000小时,可充分满足各类研究任务需求。 - **超细粒度数据,适配多场景**:经过成熟的数据生产流程与安全加固处理,结合机器自动标注与本地专家人工细粒度标注及质量核验,「万卷·丝路多模态语料库」已达到工业级数据质量标准,包含超20种细粒度多维度分类标签与详尽文本描述,可适配文旅、商贸、科技教育等多元场景,支持开箱即用,助力开发者减负增效、聚焦价值创造。 ## 🚩开源内容 - 开源图文数据超200万条; - 开源音频-文本数据时长超1600小时; - 开源视频-文本数据时长超25000小时; - 开源监督微调(SFT)数据18万条; ## 📚开源数据详情: |语言名称|图文模块数据量(图片数)|音频模块时长(小时)|视频模块时长(小时)|SFT模块数据量| |---|---|---|---|---| |阿拉伯语|220000|200|1738|23000| |俄语|250000|212|3491|23000| |韩语|530000|202|3412|23000| |越南语|450000|205|2901|23000| |泰语|100000|201|5684|23000| |塞尔维亚语|80000|206|2578|23000| |匈牙利语|220000|208|3470|23000| |捷克语|270000|202|2453|23000| ### ⚠⚠⚠ 本仓库主要提供阿拉伯语、俄语、韩语、越南语、泰语这5种语言的多模态资源,登录后可直接下载使用(无需申请)。 - 其余3种语言(塞尔维亚语、匈牙利语、捷克语)的多模态资源,请访问此页面:[https://opendatalab.com/OpenDataLab/WanJuanSiLu2](https://opendatalab.com/OpenDataLab/WanJuanSiLu2),点击申请并经作者同意后即可下载使用。 - 首批5种语言的开源纯文本语料库可通过以下页面访问并登录后直接下载: - 万卷·泰语语料库(WanJuan-Thai):[https://opendatalab.com/OpenDataLab/WanJuan-Thai](https://opendatalab.com/OpenDataLab/WanJuan-Thai) - 万卷·俄语语料库(WanJuan-Russian):[https://opendatalab.com/OpenDataLab/WanJuan-Russian](https://opendatalab.com/OpenDataLab/WanJuan-Russian) - 万卷·韩语语料库(WanJuan-Korean):[https://opendatalab.com/OpenDataLab/WanJuan-Korean](https://opendatalab.com/OpenDataLab/WanJuan-Korean) - 万卷·越南语语料库(WanJuan-Vietnamese):[https://opendatalab.com/OpenDataLab/WanJuan-Vietnamese](https://opendatalab.com/OpenDataLab/WanJuan-Vietnamese) - 万卷·阿拉伯语语料库(WanJuan-Arabic):[https://opendatalab.com/OpenDataLab/WanJuan-Arabic](https://opendatalab.com/OpenDataLab/WanJuan-Arabic) ## 数据处理特性: #### 📸图文数据: - 多领域均衡覆盖:包含来自8种语言国家的维基百科、维基语录、百科全书及主流媒体新闻的高质量图文数据; - 双标注创新模式:采用替代文本(Alt-text)基础描述 + 视觉模型生成扩展描述的方式,提升信息丰富度; - 均匀分布于10大高关注度领域,避免数据倾斜;标签涵盖:户外场景、室内场景、城市场景、乡村场景、文本科技、自然风景、民俗传统、人物、美食。 #### 🎥音频-文本数据: - 双自动语音识别(Automatic Speech Recognition, ASR)校验保障超高数据质量:本数据集采用从主流流媒体视频平台转录的音频-文本数据,通过谷歌(Google)与微软(Microsoft)双商用ASR引擎交叉验证,确保高精度文本标注,并结合环境降噪技术提升音质; - 真实场景语音:包含环境噪声的自然会话数据,贴近实际应用场景;相较于同类数据集,本数据集在多语言覆盖、会话真实性与标注质量上具备显著优势; - 四大数据类别:社会人文、娱乐传媒、知识教育、生活文化。 #### 📞视频-文本数据: - 丰富语言品类,填补数据空白:8种语言(含匈牙利语、塞尔维亚语等)的视频总时长超16000小时;相较于同类数据集,本数据集涵盖众多低资源语言,填补了这些语言在视频数据集领域的空白,是多模态研究与低资源语言处理的宝贵资源; - 多模态标注体系,构建细粒度标签与描述:同时提供视频画面标注、字幕标注、画面与字幕集成标注三种形式,为多模态模型研发提供更全面的信息支撑;共包含17种多维标签,满足多样化需求; - 标签构成 <table> <tr> <th>一级标签</th> <th>二级标签</th> </tr> <tr> <td rowspan="4">通用</td> <td>科技与战略</td> </tr> <tr> <td>文化</td> </tr> <tr> <td>影视与动画</td> </tr> <tr> <td>旅游</td> </tr> <tr> <td rowspan="3">人物</td> <td>人物</td> </tr> <tr> <td>动物</td> </tr> <tr> <td>访谈</td> </tr> <tr> <td rowspan="5">场景</td> <td>音乐</td> </tr> <tr> <td>游戏</td> </tr> <tr> <td>新闻</td> </tr> <tr> <td>教程</td> </tr> <tr> <td>体育</td> </tr> <tr> <td>其他</td> <td>其他</td> </tr> </table> #### 🤖监督微调(SFT)数据专属说明: - 文化对抗样本:包含由本地居民设计的与文化相关的问答对,用于检测模型中的文化偏见; - 混合质检流程:采用规则+模型评分的方式筛选翻译数据,降低低资源语言中的数据噪声; - 提供非英语文化语料(如本地生活/传统习俗),缓解以英语数据为主导的刻板印象; - 五大核心标签:文化、代码、本地生活、AI4S、数学。 ## 许可证 「万卷·丝路多模态语料库」整体采用CC BY 4.0许可协议。您可自由共享、改编本数据集,但需遵守以下条款: - 署名:您必须以合理方式标注原作者、提供本协议链接,并说明是否对原数据集进行了修改,且不得暗示许可方同意您或您的使用行为; - 无额外限制:您不得使用法律条款或技术措施限制他人行使本协议允许的任何操作。完整协议文本请访问CC BY 4.0协议官方页面。 ## 特别说明 请注意,本数据集的部分子集可能受其他协议条款约束。在使用特定子集前,请务必仔细阅读相关协议以确保合规。如需更详细的协议信息,请查阅对应子集的相关文档或元数据。 作为非营利性组织,OpenDataLab倡导和谐友好的开源交流环境。若您发现开源数据集中存在侵犯您合法权益的内容,可发送邮件至(OpenDataLab@pjlab.org.cn)。邮件中请详细说明侵权事实并提供相关权属证明材料,我们将在3个工作日内启动调查处理机制并采取必要措施(如下架相关数据)。但您需确保投诉内容真实,否则由此产生的不利后果将由您自行承担。 ## 引用说明 使用「万卷·丝路多模态语料库」时,请添加以下引用信息: @misc{he2024opendatalabempoweringgeneralartificial, title={OpenDataLab: Empowering General Artificial Intelligence with Open Datasets}, author={Conghui He and Wei Li and Zhenjiang Jin and Chao Xu and Bin Wang and Dahua Lin}, year={2024}, eprint={2407.13773}, archivePrefix={arXiv}, primaryClass={cs.DL}, url={https://arxiv.org/abs/2407.13773}, }
提供机构:
maas
创建时间:
2025-11-26
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作