five

visper

收藏
魔搭社区2026-04-28 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/tiiuae/visper
下载链接
链接失效反馈
官方服务:
资源简介:
# ViSpeR: Multilingual Audio-Visual Speech Recognition This repository contains **ViSpeR**, a large-scale dataset and models for Visual Speech Recognition for Arabic, Chinese, French, Arabic and Spanish. ## Dataset Summary: Given the scarcity of publicly available VSR data for non-English languages, we collected VSR data for the most four spoken languages at scale. Comparison of VSR datasets. Our proposed ViSpeR dataset is larger in size compared to other datasets that cover non-English languages for the VSR task. For our dataset, the numbers in parenthesis denote the number of clips. We also give the clip coverage under TedX and Wild subsets of our ViSpeR dataset. | Dataset | French (fr) | Spanish (es) | Arabic (ar) | Chinese (zh) | |-----------------|-----------------|-----------------|-----------------|-----------------| | **MuAVIC** | 176 | 178 | 16 | -- | | **VoxCeleb2** | 124 | 42 | -- | -- | | **AVSpeech** | 122 | 270 | -- | -- | | **ViSpeR (TedX)** | 192 (160k) | 207 (151k) | 49 (48k) | 129 (143k) | | **ViSpeR (Wild)** | 680 (481k) | 587 (383k) | 1152 (1.01M) | 658 (593k) | | **ViSpeR (full)** | 872 (641k) | 794 (534k) | 1200 (1.06M) | 787 (736k) | ## Downloading the data: First, use the provided video lists to download the videos and put them in seperate folders. The raw data should be structured as follows: ```bash Data/ ├── Chinese/ │ ├── video_id.mp4 │ └── ... ├── Arabic/ │ ├── video_id.mp4 │ └── ... ├── French/ │ ├── video_id.mp4 │ └── ... ├── Spanish/ │ ├── video_id.mp4 │ └── ... ``` ## Processing the data: Please refer to our for further details [visper github](https://github.com/YasserdahouML/visper) ## Intended Use This dataset can be used to train models for visual speech recognition. It's particularly useful for research and development purposes in the field of audio-visual content processing. The data can be used to assess the performance of current and future models. ## Limitations and Biases Due to the data collection process focusing on YouTube, biases inherent to the platform may be present in the dataset. Also, while measures are taken to ensure diversity in content, the dataset might still be skewed towards certain types of content due to the filtering process. ## ViSpeR paper coming soon ## Check our VSR related works ```bash @inproceedings{djilali2023lip2vec, title={Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping}, author={Djilali, Yasser Abdelaziz Dahou and Narayan, Sanath and Boussaid, Haithem and Almazrouei, Ebtessam and Debbah, Merouane}, booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision}, pages={13790--13801}, year={2023} } @inproceedings{djilali2024vsr, title={Do VSR Models Generalize Beyond LRS3?}, author={Djilali, Yasser Abdelaziz Dahou and Narayan, Sanath and LeBihan, Eustache and Boussaid, Haithem and Almazrouei, Ebtesam and Debbah, Merouane}, booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision}, pages={6635--6644}, year={2024} } ```

# ViSpeR:多语言视听语音识别 本仓库包含**ViSpeR**,一款面向阿拉伯语、汉语、法语、阿拉伯语与西班牙语的大规模视听语音识别(Audio-Visual Speech Recognition)数据集与模型。 ## 数据集概述 鉴于当前公开可用的非英语语言视觉语音识别(Visual Speech Recognition, VSR)数据集稀缺,我们针对全球使用最广泛的四门语言大规模采集了VSR数据。 ### VSR数据集对比 我们提出的ViSpeR数据集在规模上优于其他面向非英语语言的VSR任务数据集。本数据集括号内的数值代表视频片段(clip)的数量,同时我们还给出了ViSpeR数据集的TedX子集与Wild子集的片段覆盖情况。 | 数据集 | 法语(fr) | 西班牙语(es) | 阿拉伯语(ar) | 汉语(zh) | |-----------------|-----------------|-----------------|-----------------|-----------------| | **MuAVIC** | 176 | 178 | 16 | -- | | **VoxCeleb2** | 124 | 42 | -- | -- | | **AVSpeech** | 122 | 270 | -- | -- | | **ViSpeR(TedX子集)** | 192 (160k) | 207 (151k) | 49 (48k) | 129 (143k) | | **ViSpeR(Wild子集)** | 680 (481k) | 587 (383k) | 1152 (1.01M) | 658 (593k) | | **ViSpeR(全量数据集)** | 872 (641k) | 794 (534k) | 1200 (1.06M) | 787 (736k) | ## 数据下载 首先,使用提供的视频列表下载视频,并将其分别存放至独立文件夹中。原始数据的组织结构如下: bash 数据目录/ ├── 汉语/ │ ├── video_id.mp4 │ └── ... ├── 阿拉伯语/ │ ├── video_id.mp4 │ └── ... ├── 法语/ │ ├── video_id.mp4 │ └── ... ├── 西班牙语/ │ ├── video_id.mp4 │ └── ... ## 数据处理 如需了解更多细节,请参阅我们的[ViSpeR GitHub仓库](https://github.com/YasserdahouML/visper)。 ## 预期用途 本数据集可用于训练视觉语音识别模型,尤其适用于视听内容处理领域的研发工作,同时可用于评估当前及未来模型的性能表现。 ## 局限性与偏差 由于数据采集过程聚焦于YouTube平台,该平台固有的偏差可能会体现在本数据集中。此外,尽管我们已采取措施保障内容多样性,但受筛选流程影响,数据集仍可能偏向某些特定类型的内容。 ## ViSpeR相关论文即将上线 ## 查看我们的VSR相关研究成果 bash @inproceedings{djilali2023lip2vec, title={Lip2Vec:基于隐空间到隐空间的视觉-音频表征映射的高效鲁棒视觉语音识别}, author={Djilali, Yasser Abdelaziz Dahou and Narayan, Sanath and Boussaid, Haithem and Almazrouei, Ebtessam and Debbah, Merouane}, booktitle={IEEE/CVF 国际计算机视觉大会会议论文集}, pages={13790--13801}, year={2023} } @inproceedings{djilali2024vsr, title={VSR模型能否泛化至LRS3以外的场景?}, author={Djilali, Yasser Abdelaziz Dahou and Narayan, Sanath and LeBihan, Eustache and Boussaid, Haithem and Almazrouei, Ebtesam and Debbah, Merouane}, booktitle={IEEE/CVF 计算机视觉应用冬季会议论文集}, pages={6635--6644}, year={2024} }
提供机构:
maas
创建时间:
2025-10-03
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
ViSpeR是一个大规模多语言音频-视觉语音识别数据集,专门针对阿拉伯语、中文、法语和西班牙语四种非英语语言构建,数据量远超现有同类数据集。该数据集包含TedX和Wild两个子集,总计数百万个视频片段,主要用于视觉语音识别模型的训练与研究,但需注意其数据来源于YouTube可能引入平台固有偏见。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作