five

Chinese-LiPS

收藏
魔搭社区2026-05-13 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/BAAI/Chinese-LiPS
下载链接
链接失效反馈
官方服务:
资源简介:
# Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides [![Hugging Face Datasets](https://img.shields.io/badge/🤗%20Hugging%20Face-Datasets-yellow.svg)](https://huggingface.co/datasets/BAAI/Chinese-LiPS) [![License: CC BY-NC-SA-4.0](https://img.shields.io/badge/License-CC%20BY--SA--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/) [![GitHub Pages](https://img.shields.io/badge/GitHub-Pages-blue.svg)](https://kiri0824.github.io/Chinese-LiPS/) [![arXiv](https://img.shields.io/badge/arXiv-1706.03762-b31b1b.svg)](https://arxiv.org/abs/2504.15066) ## ⭐ Introduction The **Chinese-LiPS** dataset is a multimodal dataset designed for audio-visual speech recognition (AVSR) in Mandarin Chinese. This dataset combines speech, video, and textual transcriptions to enhance automatic speech recognition (ASR) performance, especially in educational and instructional scenarios. ## 🚀 Dataset Details - **Total Duration:** 100.84 hours - **Number of Speakers:** 207 professional speakers - **Number of Clips:** 36,208 video clips - **Audio Format:** Stereo WAV, 48 kHz sampling rate - Video Format: - **Slide Video:** 1080p resolution, 30 fps - **Lip-Reading Video:** 720p resolution, 30 fps - **Annotations:** JSON format with transcriptions and extracted text from slides ### Dataset Statistics | Split | Duration (hrs) | # Segments | # Speakers | | ---------- | -------------- | ---------- | ---------- | | Train | 85.37 | 30,341 | 175 | | Validation | 5.35 | 1,959 | 11 | | Test | 10.12 | 3,908 | 21 | | **Total** | **100.84** | **36,208** | **207** | ## 📂 Dataset Organization The dataset is structured into several compressed files: - **image.zip**: First-frame images from slide videos (used for OCR and vision-language models). - **processed_test.zip processed_val.zip processed_train.zip**: Processed data with 16 kHz audio, 96×96 25-frame lip-reading videos, and JSON annotations. - train.zip, test.zip, val.zip: Data split into training, testing, and validation sets. Each contains: ``` ├── ID1_age_gender_topic/ │ ├── WAV/ │ │ ├── ID1_age_gender_topic_001.json # Annotation file │ │ ├── ID1_age_gender_topic_001.wav # Audio file (48 kHz) │ ├── PPT/ │ │ ├── ID1_age_gender_topic_001_PPT.mp4 # Slide video (1080p 30fps) │ ├── FACE/ │ │ ├── ID1_age_gender_topic_001_FACE.mp4 # Lip-reading video (720p 30fps) ├── ... ``` - **meta_all.csv, meta_train.csv, meta_valid.csv, meta_test.csv**: Metadata files with ID, TOPIC, WAV, PPT, FACE, and TEXT fields. The TOPIC field is abbreviated in Chinese as follows: DZJJ = E-sports & Gaming, JKYS = Health & Wellness, KJ = Science & Technology, LY = Travel & Exploration, QC = Automobile & Industry, RWLS = Culture & History, TY = Sports & Competitions, YS = Movies & TV Series, ZX = Others. - **meta_test.json**: Includes OCR and InternVL2 prompts for the test set. ``` wav_path: Path to the audio file. ppt_path: Path to the first-frame image of the slide video. ocr_text: Text extracted by PaddleOCR. vl2_text: Text extracted by InternVL2. gt_text: Ground truth transcription of the audio. ocr_vl2_text: OCR text reprocessed by InternVL2 (not a concatenation of PaddleOCR and InternVL2 results). ``` ## 📥 Download You can download the dataset from the following sources: - [Download from OneDrive](https://1drv.ms/f/c/721006f535f6400c/EgxA9jX1BhAggHI-hgAAAAABgpJYJF-leYBGBdmjBuBQxw) - [Download from Huggingface](https://huggingface.co/datasets/BAAI/Chinese-LiPS) - [Download from Baidu Netdisk](https://pan.baidu.com/s/11nvn79-3Inf3QDyJomlLAA?pwd=vg2a) (Password: **vg2a**) ## 📚 Citation ```bibtex @misc{zhao2025chineselipschineseaudiovisualspeech, title={Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides}, author={Jinghua Zhao and Yuhang Jia and Shiyao Wang and Jiaming Zhou and Hui Wang and Yong Qin}, year={2025}, eprint={2504.15066}, archivePrefix={arXiv}, primaryClass={cs.MM}, url={https://arxiv.org/abs/2504.15066} }

# Chinese-LiPS:一款面向汉语的唇读与演示文稿视听语音识别数据集 [![Hugging Face Datasets](https://img.shields.io/badge/🤗%20Hugging%20Face-Datasets-yellow.svg)](https://huggingface.co/datasets/BAAI/Chinese-LiPS) [![License: CC BY-NC-SA-4.0](https://img.shields.io/badge/License-CC%20BY--SA--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/) [![GitHub Pages](https://img.shields.io/badge/GitHub-Pages-blue.svg)](https://kiri0824.github.io/Chinese-LiPS/) [![arXiv](https://img.shields.io/badge/arXiv-1706.03762-b31b1b.svg)](https://arxiv.org/abs/2504.15066) ## ⭐ 介绍 **Chinese-LiPS** 是一款面向普通话的多模态视听语音识别(audio-visual speech recognition, AVSR)数据集。该数据集融合语音、视频与文本转录内容,旨在提升自动语音识别(automatic speech recognition, ASR)的性能,尤其适用于教育与教学场景。 ## 🚀 数据集详情 - **总时长**:100.84 小时 - **说话者数量**:207 名专业说话人 - **视频片段总数**:36208 段 - **音频格式**:立体声 WAV,采样率 48 kHz - **视频格式**: - **幻灯片视频**:分辨率 1080p,帧率 30 fps - **唇读视频**:分辨率 720p,帧率 30 fps - **标注格式**:采用 JSON 格式,包含转录内容与从幻灯片提取的文本 ### 数据集统计 | 数据集划分 | 时长(小时) | 片段数 | 说话者数量 | | ---------- | ------------ | ------ | ---------- | | 训练集 | 85.37 | 30341 | 175 | | 验证集 | 5.35 | 1959 | 11 | | 测试集 | 10.12 | 3908 | 21 | | **总计** | **100.84** | **36208** | **207** | ## 📂 数据集组织 该数据集以多个压缩包形式组织: - **image.zip**:包含幻灯片视频的首帧图像,可用于光学字符识别(optical character recognition, OCR)与视觉语言模型。 - **processed_test.zip、processed_val.zip、processed_train.zip**:经过预处理的数据集,包含 16 kHz 采样率的音频、分辨率 96×96 的 25 帧唇读视频,以及 JSON 格式标注。 - **train.zip、test.zip、val.zip**:分别对应训练、测试与验证集的原始数据,每个压缩包内部结构如下: ├── ID1_age_gender_topic/ │ ├── WAV/ │ │ ├── ID1_age_gender_topic_001.json # 标注文件 │ │ ├── ID1_age_gender_topic_001.wav # 48 kHz 采样率音频文件 │ ├── PPT/ │ │ ├── ID1_age_gender_topic_001_PPT.mp4 # 1080p 30fps 幻灯片视频 │ ├── FACE/ │ │ ├── ID1_age_gender_topic_001_FACE.mp4 # 720p 30fps 唇读视频 ├── ... - **meta_all.csv、meta_train.csv、meta_valid.csv、meta_test.csv**:元数据文件,包含 ID、TOPIC、WAV、PPT、FACE 与 TEXT 字段。 TOPIC 字段采用中文缩写,对应关系如下:DZJJ=电子竞技与游戏,JKYS=健康与养生,KJ=科学与技术,LY=旅游与探索,QC=汽车与工业,RWLS=文化与历史,TY=体育与赛事,YS=影视与剧集,ZX=其他。 - **meta_test.json**:包含测试集的 OCR 与 InternVL2 提示信息。 wav_path: 音频文件路径 ppt_path: 幻灯片视频首帧图像路径 ocr_text: 通过 PaddleOCR 提取的文本 vl2_text: 通过 InternVL2 提取的文本 gt_text: 音频的真实转录文本(Ground Truth) ocr_vl2_text: 经 InternVL2 重新处理的 OCR 文本(并非 PaddleOCR 与 InternVL2 结果的简单拼接) ## 📥 下载 可通过以下渠道下载该数据集: - [从 OneDrive 下载](https://1drv.ms/f/c/721006f535f6400c/EgxA9jX1BhAggHI-hgAAAAABgpJYJF-leYBGBdmjBuBQxw) - [从 Hugging Face 下载](https://huggingface.co/datasets/BAAI/Chinese-LiPS) - [从百度网盘下载](https://pan.baidu.com/s/11nvn79-3Inf3QDyJomlLAA?pwd=vg2a)(提取码:**vg2a**) ## 📚 引用 bibtex @misc{zhao2025chineselipschineseaudiovisualspeech, title={Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides}, author={Jinghua Zhao and Yuhang Jia and Shiyao Wang and Jiaming Zhou and Hui Wang and Yong Qin}, year={2025}, eprint={2504.15066}, archivePrefix={arXiv}, primaryClass={cs.MM}, url={https://arxiv.org/abs/2504.15066} }
提供机构:
maas
创建时间:
2025-04-23
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
Chinese-LiPS是一个专为普通话音频-视觉语音识别设计的100.84小时多模态数据集,包含36,208个视频片段和207名演讲者的语音、视频及文本转录,适用于教育和教学场景的语音识别研究。数据集结构清晰,提供多种下载方式,并附有详细的引用信息。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作