Chinese-LiPS

Name: Chinese-LiPS
Creator: maas
Published: 2026-05-13 01:40:37
License: 暂无描述

魔搭社区2026-05-13 更新2025-04-26 收录

下载链接：

https://modelscope.cn/datasets/BAAI/Chinese-LiPS

下载链接

链接失效反馈

官方服务：

资源简介：

# Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides [![Hugging Face Datasets](https://img.shields.io/badge/🤗%20Hugging%20Face-Datasets-yellow.svg)](https://huggingface.co/datasets/BAAI/Chinese-LiPS) [![License: CC BY-NC-SA-4.0](https://img.shields.io/badge/License-CC%20BY--SA--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/) [![GitHub Pages](https://img.shields.io/badge/GitHub-Pages-blue.svg)](https://kiri0824.github.io/Chinese-LiPS/) [![arXiv](https://img.shields.io/badge/arXiv-1706.03762-b31b1b.svg)](https://arxiv.org/abs/2504.15066) ## ⭐ Introduction The **Chinese-LiPS** dataset is a multimodal dataset designed for audio-visual speech recognition (AVSR) in Mandarin Chinese. This dataset combines speech, video, and textual transcriptions to enhance automatic speech recognition (ASR) performance, especially in educational and instructional scenarios. ## 🚀 Dataset Details - **Total Duration:** 100.84 hours - **Number of Speakers:** 207 professional speakers - **Number of Clips:** 36,208 video clips - **Audio Format:** Stereo WAV, 48 kHz sampling rate - Video Format: - **Slide Video:** 1080p resolution, 30 fps - **Lip-Reading Video:** 720p resolution, 30 fps - **Annotations:** JSON format with transcriptions and extracted text from slides ### Dataset Statistics | Split | Duration (hrs) | # Segments | # Speakers | | ---------- | -------------- | ---------- | ---------- | | Train | 85.37 | 30,341 | 175 | | Validation | 5.35 | 1,959 | 11 | | Test | 10.12 | 3,908 | 21 | | **Total** | **100.84** | **36,208** | **207** | ## 📂 Dataset Organization The dataset is structured into several compressed files: - **image.zip**: First-frame images from slide videos (used for OCR and vision-language models). - **processed_test.zip processed_val.zip processed_train.zip**: Processed data with 16 kHz audio, 96×96 25-frame lip-reading videos, and JSON annotations. - train.zip, test.zip, val.zip: Data split into training, testing, and validation sets. Each contains: ``` ├── ID1_age_gender_topic/ │ ├── WAV/ │ │ ├── ID1_age_gender_topic_001.json # Annotation file │ │ ├── ID1_age_gender_topic_001.wav # Audio file (48 kHz) │ ├── PPT/ │ │ ├── ID1_age_gender_topic_001_PPT.mp4 # Slide video (1080p 30fps) │ ├── FACE/ │ │ ├── ID1_age_gender_topic_001_FACE.mp4 # Lip-reading video (720p 30fps) ├── ... ``` - **meta_all.csv, meta_train.csv, meta_valid.csv, meta_test.csv**: Metadata files with ID, TOPIC, WAV, PPT, FACE, and TEXT fields. The TOPIC field is abbreviated in Chinese as follows: DZJJ = E-sports & Gaming, JKYS = Health & Wellness, KJ = Science & Technology, LY = Travel & Exploration, QC = Automobile & Industry, RWLS = Culture & History, TY = Sports & Competitions, YS = Movies & TV Series, ZX = Others. - **meta_test.json**: Includes OCR and InternVL2 prompts for the test set. ``` wav_path: Path to the audio file. ppt_path: Path to the first-frame image of the slide video. ocr_text: Text extracted by PaddleOCR. vl2_text: Text extracted by InternVL2. gt_text: Ground truth transcription of the audio. ocr_vl2_text: OCR text reprocessed by InternVL2 (not a concatenation of PaddleOCR and InternVL2 results). ``` ## 📥 Download You can download the dataset from the following sources: - [Download from OneDrive](https://1drv.ms/f/c/721006f535f6400c/EgxA9jX1BhAggHI-hgAAAAABgpJYJF-leYBGBdmjBuBQxw) - [Download from Huggingface](https://huggingface.co/datasets/BAAI/Chinese-LiPS) - [Download from Baidu Netdisk](https://pan.baidu.com/s/11nvn79-3Inf3QDyJomlLAA?pwd=vg2a) (Password: **vg2a**) ## 📚 Citation ```bibtex @misc{zhao2025chineselipschineseaudiovisualspeech, title={Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides}, author={Jinghua Zhao and Yuhang Jia and Shiyao Wang and Jiaming Zhou and Hui Wang and Yong Qin}, year={2025}, eprint={2504.15066}, archivePrefix={arXiv}, primaryClass={cs.MM}, url={https://arxiv.org/abs/2504.15066} }

# Chinese-LiPS：一款面向汉语的唇读与演示文稿视听语音识别数据集 [![Hugging Face Datasets](https://img.shields.io/badge/🤗%20Hugging%20Face-Datasets-yellow.svg)](https://huggingface.co/datasets/BAAI/Chinese-LiPS) [![License: CC BY-NC-SA-4.0](https://img.shields.io/badge/License-CC%20BY--SA--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/) [![GitHub Pages](https://img.shields.io/badge/GitHub-Pages-blue.svg)](https://kiri0824.github.io/Chinese-LiPS/) [![arXiv](https://img.shields.io/badge/arXiv-1706.03762-b31b1b.svg)](https://arxiv.org/abs/2504.15066) ## ⭐ 介绍 **Chinese-LiPS** 是一款面向普通话的多模态视听语音识别（audio-visual speech recognition, AVSR）数据集。该数据集融合语音、视频与文本转录内容，旨在提升自动语音识别（automatic speech recognition, ASR）的性能，尤其适用于教育与教学场景。 ## 🚀 数据集详情 - **总时长**：100.84 小时 - **说话者数量**：207 名专业说话人 - **视频片段总数**：36208 段 - **音频格式**：立体声 WAV，采样率 48 kHz - **视频格式**： - **幻灯片视频**：分辨率 1080p，帧率 30 fps - **唇读视频**：分辨率 720p，帧率 30 fps - **标注格式**：采用 JSON 格式，包含转录内容与从幻灯片提取的文本 ### 数据集统计 | 数据集划分 | 时长（小时） | 片段数 | 说话者数量 | | ---------- | ------------ | ------ | ---------- | | 训练集 | 85.37 | 30341 | 175 | | 验证集 | 5.35 | 1959 | 11 | | 测试集 | 10.12 | 3908 | 21 | | **总计** | **100.84** | **36208** | **207** | ## 📂 数据集组织该数据集以多个压缩包形式组织： - **image.zip**：包含幻灯片视频的首帧图像，可用于光学字符识别（optical character recognition, OCR）与视觉语言模型。 - **processed_test.zip、processed_val.zip、processed_train.zip**：经过预处理的数据集，包含 16 kHz 采样率的音频、分辨率 96×96 的 25 帧唇读视频，以及 JSON 格式标注。 - **train.zip、test.zip、val.zip**：分别对应训练、测试与验证集的原始数据，每个压缩包内部结构如下： ├── ID1_age_gender_topic/ │ ├── WAV/ │ │ ├── ID1_age_gender_topic_001.json # 标注文件 │ │ ├── ID1_age_gender_topic_001.wav # 48 kHz 采样率音频文件 │ ├── PPT/ │ │ ├── ID1_age_gender_topic_001_PPT.mp4 # 1080p 30fps 幻灯片视频 │ ├── FACE/ │ │ ├── ID1_age_gender_topic_001_FACE.mp4 # 720p 30fps 唇读视频 ├── ... - **meta_all.csv、meta_train.csv、meta_valid.csv、meta_test.csv**：元数据文件，包含 ID、TOPIC、WAV、PPT、FACE 与 TEXT 字段。 TOPIC 字段采用中文缩写，对应关系如下：DZJJ=电子竞技与游戏，JKYS=健康与养生，KJ=科学与技术，LY=旅游与探索，QC=汽车与工业，RWLS=文化与历史，TY=体育与赛事，YS=影视与剧集，ZX=其他。 - **meta_test.json**：包含测试集的 OCR 与 InternVL2 提示信息。 wav_path: 音频文件路径 ppt_path: 幻灯片视频首帧图像路径 ocr_text: 通过 PaddleOCR 提取的文本 vl2_text: 通过 InternVL2 提取的文本 gt_text: 音频的真实转录文本（Ground Truth） ocr_vl2_text: 经 InternVL2 重新处理的 OCR 文本（并非 PaddleOCR 与 InternVL2 结果的简单拼接） ## 📥 下载可通过以下渠道下载该数据集： - [从 OneDrive 下载](https://1drv.ms/f/c/721006f535f6400c/EgxA9jX1BhAggHI-hgAAAAABgpJYJF-leYBGBdmjBuBQxw) - [从 Hugging Face 下载](https://huggingface.co/datasets/BAAI/Chinese-LiPS) - [从百度网盘下载](https://pan.baidu.com/s/11nvn79-3Inf3QDyJomlLAA?pwd=vg2a)（提取码：**vg2a**） ## 📚 引用 bibtex @misc{zhao2025chineselipschineseaudiovisualspeech, title={Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides}, author={Jinghua Zhao and Yuhang Jia and Shiyao Wang and Jiaming Zhou and Hui Wang and Yong Qin}, year={2025}, eprint={2504.15066}, archivePrefix={arXiv}, primaryClass={cs.MM}, url={https://arxiv.org/abs/2504.15066} }

提供机构：

maas

创建时间：

2025-04-23

搜集汇总

数据集介绍