five

万卷·丝路2.0(3个语种:塞尔维亚语、匈牙利语、捷克语)

收藏
魔搭社区2025-12-09 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/OpenDataLab/WanJuanSiLu2
下载链接
链接失效反馈
官方服务:
资源简介:
# 万卷·丝路 2.0 多模态多语言语料库 ## 数据集介绍 全新升级的“万卷·丝路2.0”,带来以下三大核心提升: - **语种数量显著扩充**:在“万卷·丝路1.0” 开源的阿语、俄语、韩语、越南语、泰语5个语种基础上,“万卷·丝路2.0”新增塞尔维亚语、匈牙利语、捷克语3个稀缺语料数据,以上述 8 个关键语种,助力全球多语言应用。 - **数据模态全面升级**:与万卷·丝路1.0纯文本数据不同,万卷·丝路2.0为 8 个语种均提供了丰富的图片-文本、音频-文本、视频-文本、特色指令微调SFT四大模态数据,覆盖多模态研究全链路;整体数据总量超过1150万条,音视频时长超过2.6万小时,极大地满足了多种研究任务的需求。 - **超精细数据,多场景适用**:经成熟数据生产管线及安全加固,结合机器与当地专家人工精细化地标注质检,“万卷·丝路2.0”达工业级数据质量标准,含20余种细粒度多维分类标签及详细的文本描述,适配文化旅游、商业贸易、科技教育等不同场景,开“箱”即用,助开发者减负,专注价值创造。 ## 开源内容 图片-文本累计开源超过200W条; 音频-文本开源超过1600小时; 音频-文本开源超过2.5w小时; SFT数据开源18w条; 开源数据详情: |语种名称|图文模块数据量(张数)|音频模块时长(小时)|视频模块时长(小时)|SFT模块数据量| |---|---|---|---|---| |阿语|220,000|200|1738|23,000| |俄语|250,000|212|3491|23,000| |韩语|530,000|202|3412|23,000| |越南语|450,000|205|2901|23,000| |泰语|100,000|201|5684|23,000| |塞尔维亚语|80,000|206|2578|23,000| |匈牙利语|220,000|208|3470|23,000| |捷克语|270,000|202|2453|23,000| **【说明】** - 本仓库主要为塞尔维亚语、匈牙利语、捷克语 3 个语种的数据资源,您可以在数据集文件页面点击申请按钮,作者通过后下载使用。 - 其他5个语种资源(阿语、俄语、韩语、越南语、泰语)可访问这个链接,直接下载使用(无需申请):https://opendatalab.com/OpenDataLab/WanJuanSiLu2O ## 数据处理特点: #### 图片-文本数据: - 多领域覆盖均衡:来自八个语种国家的维基百科、维基语录、百科全书及主流媒体新闻的高质量图片-文本数据; - 双重标注创新:Alt-text基础描述 + 视觉模型生成扩展描述,提升信息丰富度; - 10个高关注领域均匀分布,避免数据倾斜;标签构成:户外场景、室内场景、城市场景、乡村场景、文字科技、自然风光、民俗传统、成年人、食物; #### 音频-文本数据: - 音频双ASR校验保证超高质量:本数据集采自主流视频媒体平台转录的音频-文本数据,通过Google和Microsoft双商用ASR引擎交叉验证,确保高精准文本标注,并结合环境噪声消除技术,提高音质; - 真实场景语音:包含环境噪声的自然对话数据,贴近实际应用,相比其他同类数据集,本数据集在多语种覆盖、对话真实性和标注质量方面具有明显优势; - 4大数据分类:社会人文、娱乐媒体、学识教育、生活文化; #### 视频-文本数据: - 丰富的语种类别,填补数据空白:8种语言(含匈牙利语/塞尔维亚语等)视频总量超16,000小时;与同类数据集相比,该数据集包括了很多低资源语种,填补了这些语言在视频数据集中的空白,是多模态研究和低资源语种处理的宝贵资源; - 多模态标注体系,构造细粒度标签与描述:同时提供视频画面标注、字幕标注以及视频画面与字幕整合标注三种形式,为多模态模型的研究与开发提供了更全面的信息支持;提供17类多维标签,满足多样化需求; - 标签构成 <table> <tr> <th>一级标签</th> <th>二级标签</th> </tr> <tr> <td rowspan="4">通用</td> <td>科技与战略</td> </tr> <tr> <td>文化</td> </tr> <tr> <td>电影与动画</td> </tr> <tr> <td>旅行</td> </tr> <tr> <td rowspan="3">人物</td> <td>人物</td> </tr> <tr> <td>动物</td> </tr> <tr> <td>访谈</td> </tr> <tr> <td rowspan="5">场景</td> <td>音乐</td> </tr> <tr> <td>游戏</td> </tr> <tr> <td>新闻</td> </tr> <tr> <td>教程</td> </tr> <tr> <td>体育</td> </tr> <tr> <td>其他</td> <td>其他</td> </tr> </table> #### 特色指令微调SFT数据: - 文化对抗样本:包含本土居民设计的文化相关问答对,检测模型文化偏见 - 混合质检流程:规则+模型评分筛选翻译数据,降低低资源语种噪声 - 提供非英语文化语料(如本地生活/传统习俗),缓解英文数据主导的刻板印象 - 5大标签构成:文化、代码、本地生活、AI4S、数学 ## 许可 万卷·丝路2.0 整体采用CC BY 4.0许可协议。您可以自由共享、改编该数据集,唯需遵循以下条件: - 署名:您必须适当地标明作者、提供指向本协议的链接,以及指明是否(对原始数据集)做了修改。您可以以任何合理的方式这样做,但不能以任何方式暗示许可人同意您或您的使用。 - 没有附加限制:您不得使用法律条款或技术措施来限制他人执行许可证允许的任何操作。 完整协议内容,请访问[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)协议全文。 ## 特别注意事项 请注意,本数据集的某些子集可能受制于其他协议规定。在使用特定子集之前,请务必仔细阅读相关协议,确保合规使用。更为详细的协议信息,请在特定子集的相关文档或元数据中查看。 OpenDataLab作为非盈利机构,倡导和谐友好的开源交流环境,若在开源数据集内发现有侵犯您合法权益的内容,可发送邮件至([OpenDataLab@pjlab.org.cn](mailto:OpenDataLab@pjlab.org.cn)),邮件中请写明侵权相关事实的详细描述并向我们提供相关的权属证明资料。我们将于3个工作日内启动调查处理机制,并采取必要的措施进行处置(如下架相关数据)。但您应确保您投诉的真实性,否则采取措施后所产生的不利后果应由您独立承担。 ## 引文 使用 万卷·丝路2.0 ,请添加以下引文: ``` @misc{he2024opendatalabempoweringgeneralartificial, title={OpenDataLab: Empowering General Artificial Intelligence with Open Datasets}, author={Conghui He and Wei Li and Zhenjiang Jin and Chao Xu and Bin Wang and Dahua Lin}, year={2024}, eprint={2407.13773}, archivePrefix={arXiv}, primaryClass={cs.DL}, url={https://arxiv.org/abs/2407.13773}, } ```‌​‌‌​​​​‌​​​‌‌‌‌‌​​‌‌​‌​‌​​‌​​​‌‌​‌‌‌​‌‌‌​​‌‌‌‌​‌​​​‌​‌‌‌​​‌‌‌‌​‌​‌‌​​‌‌‌​​‌‌‌‌​‌​​‌‌‌​‌

# WanJuan·Silk Road 2.0 Multimodal Multilingual Corpus ## Dataset Introduction The newly upgraded WanJuan·Silk Road 2.0 brings three core improvements as follows: - **Expanded Language Coverage**: Based on the 5 languages (Arabic, Russian, Korean, Vietnamese, Thai) open-sourced in WanJuan·Silk Road 1.0, WanJuan·Silk Road 2.0 adds 3 low-resource language corpora: Serbian, Hungarian and Czech, totaling 8 key languages to support global multilingual applications. - **Comprehensive Modal Upgrade**: Different from the pure text data of WanJuan·Silk Road 1.0, WanJuan·Silk Road 2.0 provides four types of multimodal data for all 8 languages: image-text, audio-text, video-text, and specialized supervised fine-tuning (SFT) data, covering the entire pipeline of multimodal research. The total number of data items exceeds 11.5 million, and the total duration of audio and video data exceeds 26,000 hours, greatly meeting the needs of various research tasks. - **Ultra-fine Data for Multi-scenario Applications**: Through a mature data production pipeline and security hardening, combined with fine-grained manual annotation and quality inspection by machines and local experts, WanJuan·Silk Road 2.0 meets industrial-grade data quality standards. It contains more than 20 fine-grained multi-dimensional classification tags and detailed text descriptions, adapting to scenarios such as cultural tourism, commercial trade, technology and education, and is ready to use out of the box, helping developers reduce workload and focus on value creation. ## Open-Sourced Content The cumulative number of open-sourced image-text data exceeds 2,000,000 items; The open-sourced audio-text data exceeds 1,600 hours; The open-sourced video-text data exceeds 25,000 hours; The open-sourced SFT data totals 180,000 items; Detailed open-sourced data statistics: | Language Name | Image-Text Module Data Volume (Number of Items) | Audio Module Duration (Hours) | Video Module Duration (Hours) | SFT Module Data Volume | |---|---|---|---|---| | Arabic | 220,000 | 200 | 1,738 | 23,000 | | Russian | 250,000 | 212 | 3,491 | 23,000 | | Korean | 530,000 | 202 | 3,412 | 23,000 | | Vietnamese | 450,000 | 205 | 2,901 | 23,000 | | Thai | 100,000 | 201 | 5,684 | 23,000 | | Serbian | 80,000 | 206 | 2,578 | 23,000 | | Hungarian | 220,000 | 208 | 3,470 | 23,000 | | Czech | 270,000 | 202 | 2,453 | 23,000 | **[Note]** - This repository mainly provides data resources for Serbian, Hungarian and Czech. You can click the application button on the dataset file page, and download and use it after the author's approval. - Resources for the other 5 languages (Arabic, Russian, Korean, Vietnamese, Thai) can be directly downloaded without application via this link: https://opendatalab.com/OpenDataLab/WanJuanSiLu2O ## Data Processing Characteristics #### Image-Text Data: - Balanced Coverage Across Multiple Domains: High-quality image-text data sourced from Wikipedia, Wikiquote, encyclopedias and mainstream media news of the 8 language countries; - Innovative Dual Annotation: Alt-text basic description + extended description generated by visual models, improving information richness; - Evenly distributed across 10 high-concern domains to avoid data skew. The tag categories include: outdoor scenes, indoor scenes, urban scenes, rural scenes, text & technology, natural scenery, folk traditions, adults, food; #### Audio-Text Data: - Dual ASR Verification for Ultra-high Quality: The audio-text data in this dataset is transcribed from mainstream video media platforms, and cross-validated by Google and Microsoft commercial ASR engines to ensure highly accurate text annotations. Combined with environmental noise reduction technology, the audio quality is improved; - Real-scenario Speech: Contains natural dialogue data with environmental noise, which is close to practical applications. Compared with other similar datasets, this dataset has obvious advantages in multilingual coverage, dialogue authenticity and annotation quality; - 4 major data categories: Social Humanities, Entertainment Media, Academic Education, Life and Culture; #### Video-Text Data: - Rich Language Categories, Filling Data Gaps: The total duration of video data for 8 languages (including Hungarian, Serbian and other low-resource languages) exceeds 16,000 hours. Compared with similar datasets, this dataset includes many low-resource languages, filling the gap of these languages in video datasets, and is a valuable resource for multimodal research and low-resource language processing; - Multimodal Annotation System, Constructing Fine-grained Tags and Descriptions: Three annotation forms are provided: video frame annotation, subtitle annotation, and integrated annotation of video frames and subtitles, providing more comprehensive information support for the research and development of multimodal models. 17 types of multi-dimensional tags are provided to meet diverse needs; - Tag Composition <table> <tr> <th>First-level Tag</th> <th>Second-level Tag</th> </tr> <tr> <td rowspan="4">General</td> <td>Technology and Strategy</td> </tr> <tr> <td>Culture</td> </tr> <tr> <td>Film and Animation</td> </tr> <tr> <td>Travel</td> </tr> <tr> <td rowspan="3">Person</td> <td>Person</td> </tr> <tr> <td>Animal</td> </tr> <tr> <td>Interview</td> </tr> <tr> <td rowspan="5">Scene</td> <td>Music</td> </tr> <tr> <td>Game</td> </tr> <tr> <td>News</td> </tr> <tr> <td>Tutorial</td> </tr> <tr> <td>Sports</td> </tr> <tr> <td>Other</td> <td>Other</td> </tr> </table> #### Specialized Supervised Fine-tuning (SFT) Data: - Cultural Adversarial Samples: Contains culture-related question-answer pairs designed by local residents to detect model cultural bias - Hybrid Quality Inspection Process: Rules + model scoring to filter translation data, reducing noise in low-resource languages - Providing non-English cultural corpora (such as local life/traditional customs) to alleviate the stereotype dominated by English data - 5 major tag categories: Culture, Code, Local Life, AI4S, Mathematics ## License WanJuan·Silk Road 2.0 is generally licensed under the CC BY 4.0 license. You can freely share and adapt this dataset, subject to the following conditions: - Attribution: You must appropriately indicate the author, provide a link to this license, and indicate whether (the original dataset) has been modified. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. - No Additional Restrictions: You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits. For the full license text, please visit the [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) license. ## Special Notes Please note that some subsets of this dataset may be subject to other agreements. Before using a specific subset, please carefully read the relevant agreements to ensure compliant use. For more detailed license information, please refer to the relevant documentation or metadata of the specific subset. OpenDataLab, as a non-profit organization, advocates a harmonious and friendly open-source communication environment. If you find content in the open-source dataset that infringes your legitimate rights and interests, please send an email to ([OpenDataLab@pjlab.org.cn](mailto:OpenDataLab@pjlab.org.cn)), please provide a detailed description of the infringement-related facts and relevant ownership certificate materials in the email. We will initiate an investigation and handling mechanism within 3 working days and take necessary measures (such as removing the relevant data). However, you must ensure the authenticity of your complaint, otherwise you shall bear the adverse consequences caused by the measures taken independently. ## Citation When using WanJuan·Silk Road 2.0, please add the following citation: @misc{he2024opendatalabempoweringgeneralartificial, title={OpenDataLab: Empowering General Artificial Intelligence with Open Datasets}, author={Conghui He and Wei Li and Zhenjiang Jin and Chao Xu and Bin Wang and Dahua Lin}, year={2024}, eprint={2407.13773}, archivePrefix={arXiv}, primaryClass={cs.DL}, url={https://arxiv.org/abs/2407.13773}, }
提供机构:
maas
创建时间:
2025-04-01
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
万卷·丝路2.0是一个多模态多语言语料库,在1.0版本基础上新增塞尔维亚语、匈牙利语、捷克语3个语种,总计覆盖8个关键语种,并提供图片-文本、音频-文本、视频-文本和指令微调SFT四大模态数据,总量超过1150万条,音视频时长超过2.6万小时。该数据集具有工业级质量标准,包含细粒度分类标签,适用于文化旅游、商业贸易等多场景,并填补了低资源语种在多模态数据集中的空白。
以上内容由遇见数据集搜集并总结生成
二维码
社区交流群
二维码
科研交流群
商业服务