Chinese-LiPS
收藏魔搭社区2026-05-13 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/BAAI/Chinese-LiPS
下载链接
链接失效反馈官方服务:
资源简介:
# Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides
[](https://huggingface.co/datasets/BAAI/Chinese-LiPS) [](https://creativecommons.org/licenses/by-nc-sa/4.0/) [](https://kiri0824.github.io/Chinese-LiPS/) [](https://arxiv.org/abs/2504.15066)
## ⭐ Introduction
The **Chinese-LiPS** dataset is a multimodal dataset designed for audio-visual speech recognition (AVSR) in Mandarin Chinese. This dataset combines speech, video, and textual transcriptions to enhance automatic speech recognition (ASR) performance, especially in educational and instructional scenarios.
## 🚀 Dataset Details
- **Total Duration:** 100.84 hours
- **Number of Speakers:** 207 professional speakers
- **Number of Clips:** 36,208 video clips
- **Audio Format:** Stereo WAV, 48 kHz sampling rate
- Video Format:
- **Slide Video:** 1080p resolution, 30 fps
- **Lip-Reading Video:** 720p resolution, 30 fps
- **Annotations:** JSON format with transcriptions and extracted text from slides
### Dataset Statistics
| Split | Duration (hrs) | # Segments | # Speakers |
| ---------- | -------------- | ---------- | ---------- |
| Train | 85.37 | 30,341 | 175 |
| Validation | 5.35 | 1,959 | 11 |
| Test | 10.12 | 3,908 | 21 |
| **Total** | **100.84** | **36,208** | **207** |
## 📂 Dataset Organization
The dataset is structured into several compressed files:
- **image.zip**: First-frame images from slide videos (used for OCR and vision-language models).
- **processed_test.zip processed_val.zip processed_train.zip**: Processed data with 16 kHz audio, 96×96 25-frame lip-reading videos, and JSON annotations.
- train.zip, test.zip, val.zip: Data split into training, testing, and validation sets. Each contains:
```
├── ID1_age_gender_topic/
│ ├── WAV/
│ │ ├── ID1_age_gender_topic_001.json # Annotation file
│ │ ├── ID1_age_gender_topic_001.wav # Audio file (48 kHz)
│ ├── PPT/
│ │ ├── ID1_age_gender_topic_001_PPT.mp4 # Slide video (1080p 30fps)
│ ├── FACE/
│ │ ├── ID1_age_gender_topic_001_FACE.mp4 # Lip-reading video (720p 30fps)
├── ...
```
- **meta_all.csv, meta_train.csv, meta_valid.csv, meta_test.csv**: Metadata files with ID, TOPIC, WAV, PPT, FACE, and TEXT fields.
The TOPIC field is abbreviated in Chinese as follows: DZJJ = E-sports & Gaming, JKYS = Health & Wellness, KJ = Science & Technology, LY = Travel & Exploration, QC = Automobile & Industry, RWLS = Culture & History, TY = Sports & Competitions, YS = Movies & TV Series, ZX = Others.
- **meta_test.json**: Includes OCR and InternVL2 prompts for the test set.
```
wav_path: Path to the audio file.
ppt_path: Path to the first-frame image of the slide video.
ocr_text: Text extracted by PaddleOCR.
vl2_text: Text extracted by InternVL2.
gt_text: Ground truth transcription of the audio.
ocr_vl2_text: OCR text reprocessed by InternVL2 (not a concatenation of PaddleOCR and InternVL2 results).
```
## 📥 Download
You can download the dataset from the following sources:
- [Download from OneDrive](https://1drv.ms/f/c/721006f535f6400c/EgxA9jX1BhAggHI-hgAAAAABgpJYJF-leYBGBdmjBuBQxw)
- [Download from Huggingface](https://huggingface.co/datasets/BAAI/Chinese-LiPS)
- [Download from Baidu Netdisk](https://pan.baidu.com/s/11nvn79-3Inf3QDyJomlLAA?pwd=vg2a) (Password: **vg2a**)
## 📚 Citation
```bibtex
@misc{zhao2025chineselipschineseaudiovisualspeech,
title={Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides},
author={Jinghua Zhao and Yuhang Jia and Shiyao Wang and Jiaming Zhou and Hui Wang and Yong Qin},
year={2025},
eprint={2504.15066},
archivePrefix={arXiv},
primaryClass={cs.MM},
url={https://arxiv.org/abs/2504.15066}
}
# Chinese-LiPS:一款面向汉语的唇读与演示文稿视听语音识别数据集
[](https://huggingface.co/datasets/BAAI/Chinese-LiPS) [](https://creativecommons.org/licenses/by-nc-sa/4.0/) [](https://kiri0824.github.io/Chinese-LiPS/) [](https://arxiv.org/abs/2504.15066)
## ⭐ 介绍
**Chinese-LiPS** 是一款面向普通话的多模态视听语音识别(audio-visual speech recognition, AVSR)数据集。该数据集融合语音、视频与文本转录内容,旨在提升自动语音识别(automatic speech recognition, ASR)的性能,尤其适用于教育与教学场景。
## 🚀 数据集详情
- **总时长**:100.84 小时
- **说话者数量**:207 名专业说话人
- **视频片段总数**:36208 段
- **音频格式**:立体声 WAV,采样率 48 kHz
- **视频格式**:
- **幻灯片视频**:分辨率 1080p,帧率 30 fps
- **唇读视频**:分辨率 720p,帧率 30 fps
- **标注格式**:采用 JSON 格式,包含转录内容与从幻灯片提取的文本
### 数据集统计
| 数据集划分 | 时长(小时) | 片段数 | 说话者数量 |
| ---------- | ------------ | ------ | ---------- |
| 训练集 | 85.37 | 30341 | 175 |
| 验证集 | 5.35 | 1959 | 11 |
| 测试集 | 10.12 | 3908 | 21 |
| **总计** | **100.84** | **36208** | **207** |
## 📂 数据集组织
该数据集以多个压缩包形式组织:
- **image.zip**:包含幻灯片视频的首帧图像,可用于光学字符识别(optical character recognition, OCR)与视觉语言模型。
- **processed_test.zip、processed_val.zip、processed_train.zip**:经过预处理的数据集,包含 16 kHz 采样率的音频、分辨率 96×96 的 25 帧唇读视频,以及 JSON 格式标注。
- **train.zip、test.zip、val.zip**:分别对应训练、测试与验证集的原始数据,每个压缩包内部结构如下:
├── ID1_age_gender_topic/
│ ├── WAV/
│ │ ├── ID1_age_gender_topic_001.json # 标注文件
│ │ ├── ID1_age_gender_topic_001.wav # 48 kHz 采样率音频文件
│ ├── PPT/
│ │ ├── ID1_age_gender_topic_001_PPT.mp4 # 1080p 30fps 幻灯片视频
│ ├── FACE/
│ │ ├── ID1_age_gender_topic_001_FACE.mp4 # 720p 30fps 唇读视频
├── ...
- **meta_all.csv、meta_train.csv、meta_valid.csv、meta_test.csv**:元数据文件,包含 ID、TOPIC、WAV、PPT、FACE 与 TEXT 字段。
TOPIC 字段采用中文缩写,对应关系如下:DZJJ=电子竞技与游戏,JKYS=健康与养生,KJ=科学与技术,LY=旅游与探索,QC=汽车与工业,RWLS=文化与历史,TY=体育与赛事,YS=影视与剧集,ZX=其他。
- **meta_test.json**:包含测试集的 OCR 与 InternVL2 提示信息。
wav_path: 音频文件路径
ppt_path: 幻灯片视频首帧图像路径
ocr_text: 通过 PaddleOCR 提取的文本
vl2_text: 通过 InternVL2 提取的文本
gt_text: 音频的真实转录文本(Ground Truth)
ocr_vl2_text: 经 InternVL2 重新处理的 OCR 文本(并非 PaddleOCR 与 InternVL2 结果的简单拼接)
## 📥 下载
可通过以下渠道下载该数据集:
- [从 OneDrive 下载](https://1drv.ms/f/c/721006f535f6400c/EgxA9jX1BhAggHI-hgAAAAABgpJYJF-leYBGBdmjBuBQxw)
- [从 Hugging Face 下载](https://huggingface.co/datasets/BAAI/Chinese-LiPS)
- [从百度网盘下载](https://pan.baidu.com/s/11nvn79-3Inf3QDyJomlLAA?pwd=vg2a)(提取码:**vg2a**)
## 📚 引用
bibtex
@misc{zhao2025chineselipschineseaudiovisualspeech,
title={Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides},
author={Jinghua Zhao and Yuhang Jia and Shiyao Wang and Jiaming Zhou and Hui Wang and Yong Qin},
year={2025},
eprint={2504.15066},
archivePrefix={arXiv},
primaryClass={cs.MM},
url={https://arxiv.org/abs/2504.15066}
}
提供机构:
maas
创建时间:
2025-04-23
搜集汇总
数据集介绍

背景与挑战
背景概述
Chinese-LiPS是一个专为普通话音频-视觉语音识别设计的100.84小时多模态数据集,包含36,208个视频片段和207名演讲者的语音、视频及文本转录,适用于教育和教学场景的语音识别研究。数据集结构清晰,提供多种下载方式,并附有详细的引用信息。
以上内容由遇见数据集搜集并总结生成



