万卷丝路-多模态 5个语种(阿语、俄语、韩语、越南语、泰语)
收藏OpenDataLab2026-06-14 更新2025-03-22 收录
下载链接:
https://opendatalab.org.cn/OpenDataLab/WanJuanSiLu2O
下载链接
链接失效反馈官方服务:
资源简介:
全新升级的“万卷·丝路2.0”,带来以下三大核心提升:
语种数量显著扩充、数据模态全面升级,为 8 个语种均提供了丰富的图片-文本、音频-文本、视频-文本、特色指令微调SFT四大模态数据,覆盖多模态研究全链路;整体数据总量超过1150万条,音视频时长超过2.6万小时,极大地满足了多种研究任务的需求。
超精细数据,多场景适用:经成熟数据生产管线及安全加固,结合机器与当地专家人工精细化地标注质检,“万卷·丝路2.0”达工业级数据质量标准,含20余种细粒度多维分类标签及详细的文本描述,适配文化旅游、商业贸易、科技教育等不同场景,开“箱”即用,助开发者减负,专注价值创造。
The newly upgraded "Wanjuan·Silk Road 2.0" features three core upgrades: Significant expansion of language varieties and comprehensive upgrade of data modalities. It provides rich four-modal datasets including image-text, audio-text, video-text, and specialized instruction fine-tuning (SFT) for 8 languages, covering the entire pipeline of multimodal research. The total dataset volume exceeds 11.5 million entries, with the total duration of audio and video materials surpassing 26,000 hours, which greatly meets the requirements of various research tasks.
Ultra-fine-grained data suitable for multiple scenarios: Built upon a mature data production pipeline and security-hardened infrastructure, combined with fine-grained manual annotation and quality inspection conducted by both machines and local experts, "Wanjuan·Silk Road 2.0" meets industrial-grade data quality standards. It is equipped with over 20 types of fine-grained multi-dimensional classification tags and detailed textual descriptions, adapting to diverse scenarios such as cultural tourism, commercial trade, and science and technology education. The dataset is plug-and-play ready, helping developers reduce their workload and focus on value creation.
提供机构:
OpenDataLab
创建时间:
2025-03-20
搜集汇总
数据集介绍

背景与挑战
背景概述
万卷丝路-多模态数据集是一个涵盖阿拉伯语、俄语、韩语、越南语和泰语的多语言语料库,提供丰富的图片-文本、音频-文本、视频-文本及SFT多模态数据,总量超过200万条图文和数千小时音视频内容。该数据集经过精细标注,包含多维分类标签,适用于文化旅游、商业贸易等多种场景,旨在支持多模态研究和应用开发。
以上内容由遇见数据集搜集并总结生成



