MODALITY corpus - SPEAKER 25 - COMMANDS C1
收藏Mendeley Data2024-01-31 更新2024-06-29 收录
下载链接:
https://mostwiedzy.pl/en/open-research-data/modality-corpus-speaker-25-commands-c1,414075641441498-0
下载链接
链接失效反馈官方服务:
资源简介:
The MODALITY corpus is one of the multimodal database of word recordings in English. It consists of over 30 hours of multimodal recordings. The database contains high-resolution, high-framerate stereoscopic video streams and audio signals obtained from a microphone array and a laptop microphone. The corpus can be employed to develop an AVSR system, as every utterance was labelled. Recordings in noisy conditions can be used to test the robustness of speech recognition systems. The language material was based on a remote control scenario and it includes 231 words -numbers, names of months and days, a set of verbs and nouns related to a computer device control. They were read by speakers as separated words and sequences resulting in a set of 12 recording sessions per speaker. Half of the sessions were recorded in quiet conditions, the other half contained three kinds of intrusive signals (traffic, babble and factory noise). The corpus includes recordings of 42 speakers (33 male, 9 female). The participants include 20 students and staff of Multimedia Systems Department of the Gdańsk University of Technology, 5 students of the Institute of English and American Studies of the University of Gdańsk, and 17 native English speakers. The dataset consist of recordings and visual features for SPEAKER 25: sex: man native speaker: yes age: 58 The test material: COMMANDS C1 All recordings for all speakers are available at http://www.modality-corpus.org/ Sample still from the corpus(SPEAKER 25) Due to the size of the corpus (approx. 2.5 TB of data), every speaker’s recording was placed in a separate zip file of the size approx. 4-7 GB each. The recordings were organized according to the speakers’ language skills. The group A (17 speakers) consists of native-speakers. Non-native speakers recordings (Polish nationals) were placed in the Group B (25 speakers). The audio files use the Waveform Audio File Format (.wav), and contain a single PCM audio stream sampled at 44.1 kSa/s with 16-bit depth. The video files utilize the Matroska Multimedia Container Format (.mkv) in which a video stream in 1080p resolution, captured at 100 fps was placed after being compressed with h.264 codec (using High 4:4:4 profile). The ‘.lab’ files are text files containing the information on word positions in audio files, and follow the HTK label format. Each line of a ‘.lab’ file contains the actual label preceded by start and end times (in 100 ns units) e.g. : 1239620000 1244790000 FIVE which denotes the word “five”, occurring between the 123.962 s and 124.479 s of audio.Word-accurate SNR values calculated for every recording are also included in the ZIP file.
MODALITY语料库(MODALITY corpus)是面向英语单词语音录制的多模态数据库之一。其总时长超过30小时,包含高分辨率、高帧率的立体视频流,以及由麦克风阵列与笔记本电脑麦克风采集的音频信号。该语料库可用于开发视听语音识别(Audio-Visual Speech Recognition, AVSR)系统,因所有语音片段均带有标注。带噪环境下的录制数据可用于测试语音识别系统的鲁棒性。本语料库的语言素材基于遥控器操控场景,涵盖231个词汇,包括数字、月份与星期名称,以及一组与计算机设备操控相关的动词和名词。所有词汇均由发音者以孤立词和序列形式朗读,每位发音者对应12个录制会话。其中半数会话于安静环境下录制,剩余半数则包含三类干扰信号:交通声、嘈杂人声与工厂噪声。本语料库共收录42位发音者的录制数据(33位男性,9位女性)。参与者包括格但斯克工业大学多媒体系统系的20名学生与教职员工、格但斯克大学英美研究学院的5名学生,以及17名英语母语者。本数据集包含编号为SPEAKER 25的录制内容与视觉特征:性别为男性,母语为英语,年龄58岁。测试素材为命令集C1。所有发音者的录制内容均可通过网址http://www.modality-corpus.org/获取。语料库样本静帧(SPEAKER 25)。由于本语料库总数据量约2.5 TB,每位发音者的录制内容均被打包为独立的zip压缩包,单包大小约4~7 GB。录制内容按照发音者的语言能力分为两组:A组包含17位英语母语发音者;非母语发音者(波兰籍)的录制数据归入B组,共25位发音者。音频文件采用波形音频文件格式(Waveform Audio File Format,.wav),包含单路脉冲编码调制(Pulse Code Modulation, PCM)音频流,采样率为44.1 kSa/s,位深度为16比特。视频文件采用Matroska多媒体容器格式(Matroska Multimedia Container Format,.mkv),其中封装了1080p分辨率、100 fps帧率的视频流,该视频流经h.264编解码器(采用High 4:4:4配置档)压缩。.lab文件为文本格式的标注文件,记录了音频文件中词汇的位置信息,遵循HTK标注格式。每个.lab文件的行均包含以起始与结束时间(单位为100纳秒)为前缀的实际标注,例如:1239620000 1244790000 FIVE,该示例表示单词“five”出现在音频的123.962秒至124.479秒区间。每个录制文件对应的逐词信噪比(Signal-to-Noise Ratio, SNR)计算值也一并包含在zip压缩包中。
创建时间:
2024-01-31



