visper

Name: visper
Creator: maas
Published: 2026-04-28 16:50:13
License: 暂无描述

魔搭社区2026-04-28 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/tiiuae/visper

下载链接

链接失效反馈

官方服务：

资源简介：

# ViSpeR: Multilingual Audio-Visual Speech Recognition This repository contains **ViSpeR**, a large-scale dataset and models for Visual Speech Recognition for Arabic, Chinese, French, Arabic and Spanish. ## Dataset Summary: Given the scarcity of publicly available VSR data for non-English languages, we collected VSR data for the most four spoken languages at scale. Comparison of VSR datasets. Our proposed ViSpeR dataset is larger in size compared to other datasets that cover non-English languages for the VSR task. For our dataset, the numbers in parenthesis denote the number of clips. We also give the clip coverage under TedX and Wild subsets of our ViSpeR dataset. | Dataset | French (fr) | Spanish (es) | Arabic (ar) | Chinese (zh) | |-----------------|-----------------|-----------------|-----------------|-----------------| | **MuAVIC** | 176 | 178 | 16 | -- | | **VoxCeleb2** | 124 | 42 | -- | -- | | **AVSpeech** | 122 | 270 | -- | -- | | **ViSpeR (TedX)** | 192 (160k) | 207 (151k) | 49 (48k) | 129 (143k) | | **ViSpeR (Wild)** | 680 (481k) | 587 (383k) | 1152 (1.01M) | 658 (593k) | | **ViSpeR (full)** | 872 (641k) | 794 (534k) | 1200 (1.06M) | 787 (736k) | ## Downloading the data: First, use the provided video lists to download the videos and put them in seperate folders. The raw data should be structured as follows: ```bash Data/ ├── Chinese/ │ ├── video_id.mp4 │ └── ... ├── Arabic/ │ ├── video_id.mp4 │ └── ... ├── French/ │ ├── video_id.mp4 │ └── ... ├── Spanish/ │ ├── video_id.mp4 │ └── ... ``` ## Processing the data: Please refer to our for further details [visper github](https://github.com/YasserdahouML/visper) ## Intended Use This dataset can be used to train models for visual speech recognition. It's particularly useful for research and development purposes in the field of audio-visual content processing. The data can be used to assess the performance of current and future models. ## Limitations and Biases Due to the data collection process focusing on YouTube, biases inherent to the platform may be present in the dataset. Also, while measures are taken to ensure diversity in content, the dataset might still be skewed towards certain types of content due to the filtering process. ## ViSpeR paper coming soon ## Check our VSR related works ```bash @inproceedings{djilali2023lip2vec, title={Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping}, author={Djilali, Yasser Abdelaziz Dahou and Narayan, Sanath and Boussaid, Haithem and Almazrouei, Ebtessam and Debbah, Merouane}, booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision}, pages={13790--13801}, year={2023} } @inproceedings{djilali2024vsr, title={Do VSR Models Generalize Beyond LRS3?}, author={Djilali, Yasser Abdelaziz Dahou and Narayan, Sanath and LeBihan, Eustache and Boussaid, Haithem and Almazrouei, Ebtesam and Debbah, Merouane}, booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision}, pages={6635--6644}, year={2024} } ```

# ViSpeR：多语言视听语音识别本仓库包含**ViSpeR**，一款面向阿拉伯语、汉语、法语、阿拉伯语与西班牙语的大规模视听语音识别（Audio-Visual Speech Recognition）数据集与模型。 ## 数据集概述鉴于当前公开可用的非英语语言视觉语音识别（Visual Speech Recognition, VSR）数据集稀缺，我们针对全球使用最广泛的四门语言大规模采集了VSR数据。 ### VSR数据集对比我们提出的ViSpeR数据集在规模上优于其他面向非英语语言的VSR任务数据集。本数据集括号内的数值代表视频片段（clip）的数量，同时我们还给出了ViSpeR数据集的TedX子集与Wild子集的片段覆盖情况。 | 数据集 | 法语（fr） | 西班牙语（es） | 阿拉伯语（ar） | 汉语（zh） | |-----------------|-----------------|-----------------|-----------------|-----------------| | **MuAVIC** | 176 | 178 | 16 | -- | | **VoxCeleb2** | 124 | 42 | -- | -- | | **AVSpeech** | 122 | 270 | -- | -- | | **ViSpeR（TedX子集）** | 192 (160k) | 207 (151k) | 49 (48k) | 129 (143k) | | **ViSpeR（Wild子集）** | 680 (481k) | 587 (383k) | 1152 (1.01M) | 658 (593k) | | **ViSpeR（全量数据集）** | 872 (641k) | 794 (534k) | 1200 (1.06M) | 787 (736k) | ## 数据下载首先，使用提供的视频列表下载视频，并将其分别存放至独立文件夹中。原始数据的组织结构如下： bash 数据目录/ ├── 汉语/ │ ├── video_id.mp4 │ └── ... ├── 阿拉伯语/ │ ├── video_id.mp4 │ └── ... ├── 法语/ │ ├── video_id.mp4 │ └── ... ├── 西班牙语/ │ ├── video_id.mp4 │ └── ... ## 数据处理如需了解更多细节，请参阅我们的[ViSpeR GitHub仓库](https://github.com/YasserdahouML/visper)。 ## 预期用途本数据集可用于训练视觉语音识别模型，尤其适用于视听内容处理领域的研发工作，同时可用于评估当前及未来模型的性能表现。 ## 局限性与偏差由于数据采集过程聚焦于YouTube平台，该平台固有的偏差可能会体现在本数据集中。此外，尽管我们已采取措施保障内容多样性，但受筛选流程影响，数据集仍可能偏向某些特定类型的内容。 ## ViSpeR相关论文即将上线 ## 查看我们的VSR相关研究成果 bash @inproceedings{djilali2023lip2vec, title={Lip2Vec：基于隐空间到隐空间的视觉-音频表征映射的高效鲁棒视觉语音识别}, author={Djilali, Yasser Abdelaziz Dahou and Narayan, Sanath and Boussaid, Haithem and Almazrouei, Ebtessam and Debbah, Merouane}, booktitle={IEEE/CVF 国际计算机视觉大会会议论文集}, pages={13790--13801}, year={2023} } @inproceedings{djilali2024vsr, title={VSR模型能否泛化至LRS3以外的场景？}, author={Djilali, Yasser Abdelaziz Dahou and Narayan, Sanath and LeBihan, Eustache and Boussaid, Haithem and Almazrouei, Ebtesam and Debbah, Merouane}, booktitle={IEEE/CVF 计算机视觉应用冬季会议论文集}, pages={6635--6644}, year={2024} }

提供机构：

maas

创建时间：

2025-10-03

搜集汇总

数据集介绍