万卷丝路-多模态 3个语种(塞尔维亚语、匈牙利语、捷克语)
收藏OpenDataLab2026-06-14 更新2025-03-22 收录
下载链接:
https://opendatalab.org.cn/OpenDataLab/WanJuanSiLu2
下载链接
链接失效反馈官方服务:
资源简介:
全新升级的“万卷·丝路2.0”,带来以下三大核心提升:
语种数量显著扩充、数据模态全面升级,为 8 个语种均提供了丰富的图片-文本、音频-文本、视频-文本、特色指令微调SFT四大模态数据,覆盖多模态研究全链路;整体数据总量超过1150万条,音视频时长超过2.6万小时,极大地满足了多种研究任务的需求。
超精细数据,多场景适用:经成熟数据生产管线及安全加固,结合机器与当地专家人工精细化地标注质检,“万卷·丝路2.0”达工业级数据质量标准,含20余种细粒度多维分类标签及详细的文本描述,适配文化旅游、商业贸易、科技教育等不同场景,开“箱”即用,助开发者减负,专注价值创造。
The newly upgraded "Wanjuan·Silk Road 2.0" delivers three core enhancements:
First, the number of supported languages has been notably expanded, and the data modalities have been fully upgraded. It provides abundant multi-modal data across four categories for all 8 languages, namely image-text, audio-text, video-text, and specialized instruction fine-tuning (SFT) data, covering the entire workflow of multimodal research. Second, the overall data scale has been greatly increased: the total number of data entries exceeds 11.5 million, and the total duration of audio and video materials surpasses 26,000 hours, which thoroughly meets the demands of various research tasks.
Featuring ultra-fine-grained data and applicability across multiple scenarios: developed through a mature data production pipeline and security hardening, combined with fine-grained manual annotation and quality inspection conducted by both automated machines and local experts, "Wanjuan·Silk Road 2.0" meets industrial-grade data quality standards. It includes more than 20 types of fine-grained multi-dimensional classification tags and detailed textual descriptions, and is compatible with diverse scenarios such as cultural tourism, commercial trade, technology and education. With out-of-the-box usability, it helps developers reduce their workload and focus on value creation.
提供机构:
OpenDataLab
创建时间:
2025-03-20
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是'万卷·丝路多模态多语言语料库'的升级版,新增塞尔维亚语、匈牙利语、捷克语三个稀缺语种,共覆盖8个关键语种,并提供图片-文本、音频-文本、视频-文本及指令微调SFT四大模态数据,总量超过1150万条,音视频时长超2.6万小时,具备工业级质量标准,适用于文化旅游、商业贸易等多场景研究。
以上内容由遇见数据集搜集并总结生成



