JURISDICT
收藏DataCite Commons2026-02-10 更新2026-05-04 收录
下载链接:
https://www.ortolang.fr/market/item/sldr000868/v2
下载链接
链接失效反馈官方服务:
资源简介:
The JURISDICT speech database is a large continuous speech database originally designed for dictated speech recognition.The database includes above 1500 annotated sessions of speakers from 16 regions of Poland, plus another 500 experimental recordings.The JURISDICT database is intended to provide material for both training and testing of speech dictation of common and legal texts, including isolated word systems, word-spotting systems and vocabulary independent systems which use either whole word or sub-word modeling approaches. The typical JURISDICT recording session scenario is a mixture of semi-spontaneous (controlled dictation) and read/dictated speech. The specification is based on the general language features and also on peculiarities of Polish on the different linguistic as well as phonetic levels. The general assumptions for the structure of database take into account text features: semantic structure, syntactic factors, grammatical and acoustic-phonetic factors and speaking style: semi-spontaneous, controlled spontaneous dictation, elicited dictation (answering speech).The starting point for annotation specification applied for the present corpus were SpeeCon annotation guidelines (deliverable D214) based on orthographic, word-level transcription. In the first step, annotators (a team of students of The Faculty of Modern Languages and Literature in Poznań, above thirty people during the whole period of the annotation process) manually validated the agreement of the recorded text with the input orthographic transcription by inserting necessary adjustments, special events markers, and time boundaries. The first-step annotations were hand-validated (where necessary) by two expert phoneticians and four experienced labellers for whom the inter-labeller agreement was monitored, especially as concerned the number and types of special events and time boundary insertion and spelling errors. The inter-labeler agreement concerning the time boundaries was high (above 90%), the agreement for the special events labels depended on the type of label and was best for the unintelligible speech markers (above 80%) and filled pause labels (approx. 70%). It was lower for speaker noise labels and mispronunciation markers because of a greater variation observed for one of the labelers, after excluding the results for that labeler, the agreement was up to 70%.
提供机构:
ORTOLANG (Open Resources and TOols for LANGuage) - www.ortolang.fr
创建时间:
2026-02-10



