tiiuae/visper
收藏Hugging Face2025-04-17 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/tiiuae/visper
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-2.0
language:
- ar
- fr
- es
- zh
pretty_name: visper
---
# ViSpeR: Multilingual Audio-Visual Speech Recognition
This repository contains **ViSpeR**, a large-scale dataset and models for Visual Speech Recognition for Arabic, Chinese, French, Arabic and Spanish.
## Dataset Summary:
Given the scarcity of publicly available VSR data for non-English languages, we collected VSR data for the most four spoken languages at scale.
Comparison of VSR datasets. Our proposed ViSpeR dataset is larger in size compared to other datasets that cover non-English languages for the VSR task. For our dataset, the numbers in parenthesis denote the number of clips. We also give the clip coverage under TedX and Wild subsets of our ViSpeR dataset.
| Dataset | French (fr) | Spanish (es) | Arabic (ar) | Chinese (zh) |
|-----------------|-----------------|-----------------|-----------------|-----------------|
| **MuAVIC** | 176 | 178 | 16 | -- |
| **VoxCeleb2** | 124 | 42 | -- | -- |
| **AVSpeech** | 122 | 270 | -- | -- |
| **ViSpeR (TedX)** | 192 (160k) | 207 (151k) | 49 (48k) | 129 (143k) |
| **ViSpeR (Wild)** | 680 (481k) | 587 (383k) | 1152 (1.01M) | 658 (593k) |
| **ViSpeR (full)** | 872 (641k) | 794 (534k) | 1200 (1.06M) | 787 (736k) |
## Downloading the data:
First, use the provided video lists to download the videos and put them in seperate folders. The raw data should be structured as follows:
```bash
Data/
├── Chinese/
│ ├── video_id.mp4
│ └── ...
├── Arabic/
│ ├── video_id.mp4
│ └── ...
├── French/
│ ├── video_id.mp4
│ └── ...
├── Spanish/
│ ├── video_id.mp4
│ └── ...
```
## Processing the data:
Please refer to our for further details [visper github](https://github.com/YasserdahouML/visper)
## Intended Use
This dataset can be used to train models for visual speech recognition. It's particularly useful for research and development purposes in the field of audio-visual content processing. The data can be used to assess the performance of current and future models.
## Limitations and Biases
Due to the data collection process focusing on YouTube, biases inherent to the platform may be present in the dataset. Also, while measures are taken to ensure diversity in content, the dataset might still be skewed towards certain types of content due to the filtering process.
## ViSpeR paper coming soon
## Check our VSR related works
```bash
@inproceedings{djilali2023lip2vec,
title={Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping},
author={Djilali, Yasser Abdelaziz Dahou and Narayan, Sanath and Boussaid, Haithem and Almazrouei, Ebtessam and Debbah, Merouane},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={13790--13801},
year={2023}
}
@inproceedings{djilali2024vsr,
title={Do VSR Models Generalize Beyond LRS3?},
author={Djilali, Yasser Abdelaziz Dahou and Narayan, Sanath and LeBihan, Eustache and Boussaid, Haithem and Almazrouei, Ebtesam and Debbah, Merouane},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
pages={6635--6644},
year={2024}
}
```
提供机构:
tiiuae
原始信息汇总
ViSpeR: Multilingual Audio-Visual Speech Recognition
数据集概述
ViSpeR是一个大规模的多语言视听语音识别数据集,涵盖阿拉伯语、中文、法语和西班牙语。由于非英语语言的视听语音识别(VSR)数据稀缺,我们收集了这四种最常用语言的大规模VSR数据。
数据集比较
ViSpeR数据集在覆盖非英语语言的VSR任务方面,规模大于其他数据集。数据集中的数字表示片段数量。ViSpeR数据集分为TedX、Wild和完整三个子集。
| 数据集 | 法语 (fr) | 西班牙语 (es) | 阿拉伯语 (ar) | 中文 (zh) |
|---|---|---|---|---|
| MuAVIC | 176 | 178 | 16 | -- |
| VoxCeleb2 | 124 | 42 | -- | -- |
| AVSpeech | 122 | 270 | -- | -- |
| ViSpeR (TedX) | 192 (160k) | 207 (151k) | 49 (48k) | 129 (143k) |
| ViSpeR (Wild) | 680 (481k) | 587 (383k) | 1152 (1.01M) | 658 (593k) |
| ViSpeR (full) | 872 (641k) | 794 (534k) | 1200 (1.06M) | 787 (736k) |
数据下载
使用提供的视频列表下载视频,并将它们分别放入不同的文件夹中。原始数据应按以下结构组织:
bash Data/ ├── Chinese/ │ ├── video_id.mp4 │ └── ... ├── Arabic/ │ ├── video_id.mp4 │ └── ... ├── French/ │ ├── video_id.mp4 │ └── ... ├── Spanish/ │ ├── video_id.mp4 │ └── ...
数据处理
请参考我们的visper github以获取更多详细信息。
预期用途
该数据集可用于训练视听语音识别模型,特别适用于音频-视觉内容处理领域的研究和开发。数据可用于评估当前和未来模型的性能。
局限性和偏见
由于数据收集过程集中在YouTube上,数据集中可能存在平台固有的偏见。此外,尽管采取了措施确保内容的多样性,但由于筛选过程,数据集可能仍偏向于某些类型的内容。



