Chinese-LiPS|音频-视觉语音识别数据集|汉语语言处理数据集

魔搭社区2025-06-06 更新2025-04-26 收录

音频-视觉语音识别

汉语语言处理

下载链接：

https://modelscope.cn/datasets/BAAI/Chinese-LiPS

下载链接

链接失效反馈

资源简介：

# Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides [![Hugging Face Datasets](https://img.shields.io/badge/🤗%20Hugging%20Face-Datasets-yellow.svg)](https://huggingface.co/datasets/BAAI/Chinese-LiPS) [![License: CC BY-NC-SA-4.0](https://img.shields.io/badge/License-CC%20BY--SA--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/) [![GitHub Pages](https://img.shields.io/badge/GitHub-Pages-blue.svg)](https://kiri0824.github.io/Chinese-LiPS/) [![arXiv](https://img.shields.io/badge/arXiv-1706.03762-b31b1b.svg)](https://arxiv.org/abs/2504.15066) ## ⭐ Introduction The **Chinese-LiPS** dataset is a multimodal dataset designed for audio-visual speech recognition (AVSR) in Mandarin Chinese. This dataset combines speech, video, and textual transcriptions to enhance automatic speech recognition (ASR) performance, especially in educational and instructional scenarios. ## 🚀 Dataset Details - **Total Duration:** 100.84 hours - **Number of Speakers:** 207 professional speakers - **Number of Clips:** 36,208 video clips - **Audio Format:** Stereo WAV, 48 kHz sampling rate - Video Format: - **Slide Video:** 1080p resolution, 30 fps - **Lip-Reading Video:** 720p resolution, 30 fps - **Annotations:** JSON format with transcriptions and extracted text from slides ### Dataset Statistics | Split | Duration (hrs) | # Segments | # Speakers | | ---------- | -------------- | ---------- | ---------- | | Train | 85.37 | 30,341 | 175 | | Validation | 5.35 | 1,959 | 11 | | Test | 10.12 | 3,908 | 21 | | **Total** | **100.84** | **36,208** | **207** | ## 📂 Dataset Organization The dataset is structured into several compressed files: - **image.zip**: First-frame images from slide videos (used for OCR and vision-language models). - **processed_test.zip processed_val.zip processed_train.zip**: Processed data with 16 kHz audio, 96×96 25-frame lip-reading videos, and JSON annotations. - train.zip, test.zip, val.zip: Data split into training, testing, and validation sets. Each contains: ``` ├── ID1_age_gender_topic/ │ ├── WAV/ │ │ ├── ID1_age_gender_topic_001.json # Annotation file │ │ ├── ID1_age_gender_topic_001.wav # Audio file (48 kHz) │ ├── PPT/ │ │ ├── ID1_age_gender_topic_001_PPT.mp4 # Slide video (1080p 30fps) │ ├── FACE/ │ │ ├── ID1_age_gender_topic_001_FACE.mp4 # Lip-reading video (720p 30fps) ├── ... ``` - **meta_all.csv, meta_train.csv, meta_valid.csv, meta_test.csv**: Metadata files with ID, TOPIC, WAV, PPT, FACE, and TEXT fields. The TOPIC field is abbreviated in Chinese as follows: DZJJ = E-sports & Gaming, JKYS = Health & Wellness, KJ = Science & Technology, LY = Travel & Exploration, QC = Automobile & Industry, RWLS = Culture & History, TY = Sports & Competitions, YS = Movies & TV Series, ZX = Others. - **meta_test.json**: Includes OCR and InternVL2 prompts for the test set. ``` wav_path: Path to the audio file. ppt_path: Path to the first-frame image of the slide video. ocr_text: Text extracted by PaddleOCR. vl2_text: Text extracted by InternVL2. gt_text: Ground truth transcription of the audio. ocr_vl2_text: OCR text reprocessed by InternVL2 (not a concatenation of PaddleOCR and InternVL2 results). ``` ## 📥 Download You can download the dataset from the following sources: - [Download from OneDrive](https://1drv.ms/f/c/721006f535f6400c/EgxA9jX1BhAggHI-hgAAAAABgpJYJF-leYBGBdmjBuBQxw) - [Download from Huggingface](https://huggingface.co/datasets/BAAI/Chinese-LiPS) - [Download from Baidu Netdisk](https://pan.baidu.com/s/11nvn79-3Inf3QDyJomlLAA?pwd=vg2a) (Password: **vg2a**) ## 📚 Citation ```bibtex @misc{zhao2025chineselipschineseaudiovisualspeech, title={Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides}, author={Jinghua Zhao and Yuhang Jia and Shiyao Wang and Jiaming Zhou and Hui Wang and Yong Qin}, year={2025}, eprint={2504.15066}, archivePrefix={arXiv}, primaryClass={cs.MM}, url={https://arxiv.org/abs/2504.15066} }

提供机构：

maas

创建时间：

2025-04-23

用户留言

有没有相关的论文或文献参考？

这个数据集是基于什么背景创建的？

数据集的作者是谁？

能帮我联系到这个数据集的作者吗？

这个数据集如何下载？

点击留言

数据主题

具身智能

数据集 4098个

机构 8个

大模型

数据集 439个

机构 10个

无人机

数据集 37个

机构 6个

指令微调

数据集 36个

机构 6个

蛋白质结构

数据集 50个

机构 8个

空间智能

数据集 21个

机构 5个

5,000+

优质数据集

54 个

任务类型

进入经典数据集

热门数据集

LinkedIn Salary Insights Dataset

LinkedIn Salary Insights Dataset 提供了全球范围内的薪资数据，包括不同职位、行业、地理位置和经验水平的薪资信息。该数据集旨在帮助用户了解薪资趋势和市场行情，支持职业规划和薪资谈判。

www.linkedin.com 收录

中国1km分辨率逐月降水量数据集（1901-2023）

该数据集为中国逐月降水量数据，空间分辨率为0.0083333°（约1km），时间为1901.1-2023.12。数据格式为NETCDF，即.nc格式。该数据集是根据CRU发布的全球0.5°气候数据集以及WorldClim发布的全球高分辨率气候数据集，通过Delta空间降尺度方案在中国降尺度生成的。并且，使用496个独立气象观测点数据进行验证，验证结果可信。本数据集包含的地理空间范围是全国主要陆地（包含港澳台地区），不含南海岛礁等区域。为了便于存储，数据均为int16型存于nc文件中，降水单位为0.1mm。 nc数据可使用ArcMAP软件打开制图; 并可用Matlab软件进行提取处理，Matlab发布了读入与存储nc文件的函数，读取函数为ncread，切换到nc文件存储文件夹，语句表达为：ncread (‘XXX.nc’,‘var’, [i j t],[leni lenj lent])，其中XXX.nc为文件名，为字符串需要’’；var是从XXX.nc中读取的变量名，为字符串需要’’；i、j、t分别为读取数据的起始行、列、时间，leni、lenj、lent i分别为在行、列、时间维度上读取的长度。这样，研究区内任何地区、任何时间段均可用此函数读取。Matlab的help里面有很多关于nc数据的命令，可查看。数据坐标系统建议使用WGS84。

国家青藏高原科学数据中心收录

中国食物成分数据库

食物成分数据比较准确而详细地描述农作物、水产类、畜禽肉类等人类赖以生存的基本食物的品质和营养成分含量。它是一个重要的我国公共卫生数据和营养信息资源，是提供人类基本需求和基本社会保障的先决条件；也是一个国家制定相关法规标准、实施有关营养政策、开展食品贸易和进行营养健康教育的基础，兼具学术、经济、社会等多种价值。本数据集收录了基于2002年食物成分表的1506条食物的31项营养成分（含胆固醇）数据，657条食物的18种氨基酸数据、441条食物的32种脂肪酸数据、130条食物的碘数据、114条食物的大豆异黄酮数据。

国家人口健康科学数据中心收录

中国交通事故深度调查（CIDAS）数据集

交通事故深度调查数据通过采用科学系统方法现场调查中国道路上实际发生交通事故相关的道路环境、道路交通行为、车辆损坏、人员损伤信息，以探究碰撞事故中车损和人伤机理。目前已积累深度调查事故10000余例，单个案例信息包含人、车、路和环境多维信息组成的3000多个字段。该数据集可作为深入分析中国道路交通事故工况特征，探索事故预防和损伤防护措施的关键数据源，为制定汽车安全法规和标准、完善汽车测评试验规程、

北方大数据交易中心收录

Beijing Traffic

The Beijing Traffic Dataset collects traffic speeds at 5-minute granularity for 3126 roadway segments in Beijing between 2022/05/12 and 2022/07/25.

Papers with Code 收录