ORTOFON v3: corpus of informal spoken Czech with multi-tier transcription (transcriptions & audio)
收藏hdl.handle.net2025-03-26 收录
下载链接:
http://hdl.handle.net/11234/1-5686
下载链接
链接失效反馈官方服务:
资源简介:
ORTOFON v3 is a corpus of authentic spoken Czech used in informal situations (private environment, spontaneity, unpreparedness etc.) that covers the area of the whole Czech Republic. The corpus is composed of 697 recordings from 2012–2020 and contains 2 445 793 orthographic words (i.e. a total of 2 976 742 tokens including punctuation); a total of 1 121 different speakers appear in the probes. ORTOFON v3 is partially balanced regarding the basic sociolinguistic speaker categories (gender, age group, level of education and region of childhood residence). The transcription is linked to the corresponding audio track. Unlike the ORAL-series corpora, the transcription was carried out on two main tiers, orthographic and phonetic, supplemented by an additional metalanguage tier. The (anonymized) transcriptions are provided in the XML Elan Annotation format, audio (with corresponding anonymization beeps) is in uncompressed 16-bit PCM WAV, mono, 16 kHz format. Another format option of the transcriptions is also available under less restrictive CC BY-NC-SA license at http://hdl.handle.net/11234/1-5687
ORTOFON v3系一由真实口语捷克语构成的语料库,该语料库用于非正式场合(如私人环境、即兴发挥、无准备等),涵盖了整个捷克共和国的方言。该语料库由2012至2020年间收集的697个录音组成,包含2,445,793个正字法单词(即包括标点符号在内的总词数为2,976,742个token);探查中出现了1,121位不同的说话者。ORTOFON v3在基本社会语言学说话者类别(性别、年龄段、教育水平和童年居住地区)方面部分平衡。转录文本与相应的音频轨道相链接。与ORAL系列语料库不同,转录工作在两个主要层面上进行,即正字法和语音学层面,并辅以额外的元语言层面。转录文本以匿名化形式提供,采用XML Elan标注格式,音频(包括相应的匿名化蜂鸣声)以未压缩的16位PCM WAV格式,单声道,16 kHz提供。转录文本的另一种格式选项也以更为宽松的CC BY-NC-SA许可提供,可在http://hdl.handle.net/11234/1-5687找到。
提供机构:
hdl.handle.net



