PortMedia French and Italian corpus
收藏catalogue.elra.info2014-07-23 更新2025-03-22 收录
下载链接:
https://catalogue.elra.info/en-us/repository/browse/ELRA-S0371/
下载链接
链接失效反馈官方服务:
资源简介:
The PortMedia French and Italian corpus was produced by ELDA, with the same paradigm and specifications as the MEDIA speech database (ELRA-S0272) but on a different domain.The method chosen for the corpus construction process is that of a ‘Wizard of Oz’ (WoZ) system. This consists of simulating a natural language man-machine dialogue. The scenario was built in the domain of touristic information and reservation (ticket reservation within the 2010 Festival d’Avignon for French and hotel reservation for Italian). The corpus contains 700 transcribed dialogues from about 140 French speakers and 604 transcribed dialogues from about 150 Italian speakers (several dialogues per speaker).The database is formatted following the SpeechDat conventions and it includes the following items:•700 recorded sessions for French and 604 sessions for Italian. The signals are stored in a stereo wave file format. Each of the two speech channels is recorded at 8 kHz with 16 bit quantization with the least significant byte first (“lohi” or Intel format) as signed integers. •Manual transcription of each session in HTML format. Label files were created with the free transcription tool Transcriber (TRS files).•A manual semantic annotation of the corpus. It has been produced with Semantizer, which is also provided with the data.
由ELDA制作而成的PortMedia法语和意大利语语料库,其范式与规格与MEDIA语音数据库(ELRA-S0272)相同,但应用于不同领域。语料库构建过程中所选用的方法是‘奥兹魔法师’(WoZ)系统。该系统模拟了自然语言的人机对话。场景构建于旅游信息和预订领域(2010年阿维尼翁艺术节内的票务预订针对法语,意大利语的酒店预订)。语料库包含约140名法语讲者的700个转录对话以及约150名意大利讲者的604个转录对话(每位讲者有多个对话)。数据库遵循SpeechDat规范进行格式化,并包含以下内容:•针对法语的700个录音会话和针对意大利语的604个录音会话。信号以立体声波形文件格式存储。两个语音通道均以8 kHz的采样率、16位量化(最低有效字节优先,“lohi”或Intel格式)记录为有符号整数。•每个会话的HTML格式手动转录。标签文件使用免费的转录工具Transcriber(TRS文件)创建。•语料库的手动语义标注。该标注由Semantizer工具生成,该工具亦随数据一同提供。
提供机构:
ELRA Catalogue of Language Resources



