WanJuanSiLu-Multimodal-5Languages

Opencsg2025-04-23 更新2025-04-26 收录

下载链接：

https://www.opencsg.com/datasets/AIWizards/WanJuanSiLu-Multimodal-5Languages

下载链接

链接失效反馈

官方服务：

资源简介：

“万卷·丝路多模态多语言语料库”提供了一个大规模、多模态、多语言的数据集，旨在支持全球多语言应用和多模态研究。该语料库包含图片-文本、音频-文本、视频-文本以及指令微调SFT四种模态数据，涵盖阿拉伯语、俄语、韩语、越南语、泰语、塞尔维亚语、匈牙利语和捷克语八种语言。数据总量超过1150万条，音视频时长累计超过26000小时。数据内容广泛，涉及文化旅游、商业贸易、科技教育、社会人文、娱乐媒体等多个领域，并特别关注文化对抗样本以检测模型中的文化偏见。所有数据均经过机器与本地专家的人工精细标注和质量检验，达到工业级标准，标注信息包括多维分类标签、详细文本描述以及多模态集成标注。数据采集自维基百科、主流媒体新闻、流媒体视频平台等多样化来源，并通过双重ASR验证和环境降噪技术确保音频质量。该数据集适用于对话生成、目标检测、低资源语言处理等多种任务，并采用CC BY 4.0授权许可，允许自由分享和改编，但需注明出处。

The "Junjuan·Silk Road Multimodal Multilingual Corpus" provides a large-scale, multimodal, multilingual dataset designed to support global multilingual applications and multimodal research. The corpus encompasses four types of modal data: image-text, audio-text, video-text, and instruction fine-tuning (SFT) data, covering eight languages including Arabic, Russian, Korean, Vietnamese, Thai, Serbian, Hungarian, and Czech. The total volume of data exceeds 11.5 million entries, with a cumulative audio and video duration of over 26,000 hours. The data covers a broad range of domains such as cultural tourism, business trade, science and technology education, social humanities, entertainment media and other fields, with a special focus on cultural adversarial examples for detecting cultural biases in models. All data has undergone rigorous manual annotation and quality inspection conducted jointly by machines and local experts, meeting industrial-grade quality standards. The annotation information includes multi-dimensional classification tags, detailed text descriptions, and multimodal integrated annotations. The data is collected from diverse sources including Wikipedia, mainstream media news, streaming video platforms and other channels, and audio quality is ensured through dual ASR verification and environmental noise reduction technologies. This dataset is applicable to multiple tasks such as dialogue generation, object detection, low-resource language processing and more, and is licensed under CC BY 4.0, allowing free sharing and adaptation with proper attribution indicated.

创建时间：

2025-04-24

搜集汇总

数据集介绍

背景与挑战

背景概述

“万卷·丝路多模态多语言语料库”是一个大规模、多模态、多语言的数据集，包含四种模态数据和八种语言，数据总量超过1150万条，音视频时长超过26000小时。该数据集经过精细标注和质量检验，适用于多种研究任务，并采用CC BY 4.0授权许可。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集