MODALITY corpus - SPEAKER 17 - SEQUENCE S5

Name: MODALITY corpus - SPEAKER 17 - SEQUENCE S5
Creator: Gdańsk University of Technology
Published: 2026-03-26 16:28:35
License: 暂无描述

DataCite Commons2026-03-26 更新2025-04-16 收录

下载链接：

https://mostwiedzy.pl/en/open-research-data/modality-corpus-speaker-17-sequence-s5,622111143643268-0

下载链接

链接失效反馈

官方服务：

资源简介：

The MODALITY corpus is one of the multimodal database of word recordings in English. It consists of over 30 hours of multimodal recordings. The database contains high-resolution, high-framerate stereoscopic video streams and audio signals obtained from a microphone array and a laptop microphone. The corpus can be employed to develop an AVSR system, as every utterance was labelled. Recordings in noisy conditions can be used to test the robustness of speech recognition systems. The language material was based on a remote control scenario and it includes 231 words -numbers, names of months and days, a set of verbs and nouns related to a computer device control. They were read by speakers as separated words and sequences resulting in a set of 12 recording sessions per speaker. Half of the sessions were recorded in quiet conditions, the other half contained three kinds of intrusive signals (traffic, babble and factory noise). The corpus includes recordings of 42 speakers (33 male, 9 female). The participants include 20 students and staff of Multimedia Systems Department of the Gdańsk University of Technology, 5 students of the Institute of English and American Studies of the University of Gdańsk, and 17 native English speakers. The dataset consist of recordings and visual features for SPEAKER 17: sex: woman native speaker: no age: 25 The test material: SEQUENCE S5 All recordings for all speakers are available at http://www.modality-corpus.org/ Sample still from the corpus(SPEAKER 17) Due to the size of the corpus (approx. 2.5 TB of data), every speaker’s recording was placed in a separate zip file of the size approx. 4-7 GB each. The recordings were organized according to the speakers’ language skills. The group A (17 speakers) consists of native-speakers. Non-native speakers recordings (Polish nationals) were placed in the Group B (25 speakers). The audio files use the Waveform Audio File Format (.wav), and contain a single PCM audio stream sampled at 44.1 kSa/s with 16-bit depth. The video files utilize the Matroska Multimedia Container Format (.mkv) in which a video stream in 1080p resolution, captured at 100 fps was placed after being compressed with h.264 codec (using High 4:4:4 profile). The ‘.lab’ files are text files containing the information on word positions in audio files, and follow the HTK label format. Each line of a ‘.lab’ file contains the actual label preceded by start and end times (in 100 ns units) e.g. : 1239620000 1244790000 FIVE which denotes the word “five”, occurring between the 123.962 s and 124.479 s of audio. Word-accurate SNR values calculated for every recording are also included in the ZIP file. Unfortunately, visual features are not available due to technical difficulties during the registration process.

MODALITY语料库（MODALITY corpus）是面向英文单词录制的多模态数据库之一，总计包含超过30小时的多模态录制内容。该数据库收录了高分辨率、高帧率的立体视频流，以及由麦克风阵列和笔记本麦克风采集的音频信号。由于每一段语音片段均已完成标注，该语料库可用于开发视听语音识别（Audio-Visual Speech Recognition, AVSR）系统；而带噪环境下的录制内容，则可用于测试语音识别系统的鲁棒性。语言素材基于遥控场景设计，涵盖231个单词——包括数字、月份与星期名称，以及一组与计算机设备控制相关的动词和名词。录制时由朗读者单独朗读单词或单词序列，每位朗读者需完成12组录制会话，其中一半会话在安静环境下录制，另一半则包含三类干扰信号：交通声、人声嘈杂声与工厂背景噪声。该语料库共有42位朗读者参与录制，其中男性33位、女性9位。参与者构成如下：格但斯克理工大学多媒体系统系的20名学生与教职员工、格但斯克大学英美研究学院的5名学生，以及17位以英语为母语的人士。本次数据集仅包含编号为SPEAKER 17的朗读者的录制内容与视觉特征： - 性别：女性 - 是否为母语使用者：否 - 年龄：25岁测试素材为SEQUENCE S5 所有朗读者的完整录制内容均可通过http://www.modality-corpus.org/ 获取。附语料库中SPEAKER 17的样本静帧画面。由于语料库总数据量约为2.5 TB，每位朗读者的录制内容均被打包为独立的ZIP压缩文件，单文件大小约为4-7 GB。录制内容按照朗读者的语言能力分为两组：A组（17位朗读者）为以英语为母语的使用者；非母语使用者（波兰籍）被归入B组（25位朗读者）。音频文件采用波形音频文件格式（.wav），包含单路PCM音频流，采样率为44.1千采样每秒，位深度为16比特。视频文件采用Matroska多媒体容器格式（.mkv），其中封装了经h.264编解码器（采用High 4:4:4配置档）压缩的1080p分辨率视频流，采集帧率为100 fps。.lab格式文件为文本文件，记录了音频文件中单词的位置信息，遵循HTK标注格式（HTK label format）。每一行.lab文件包含起始时间、结束时间（单位为100纳秒）与实际标注内容，例如：1239620000 1244790000 FIVE 表示单词"five"出现在音频的123.962秒至124.479秒区间内。每段录制内容对应的逐词信噪比（SNR）数值也已包含在ZIP压缩文件中。遗憾的是，由于录制过程中的技术问题，视觉特征未能提供。

提供机构：

Gdańsk University of Technology

创建时间：

2021-06-22

5,000+

优质数据集

54 个

任务类型

进入经典数据集