Iban Automatic Speech Recognition Resources

Name: Iban Automatic Speech Recognition Resources
Creator: UNIMAS Data for Research
Published: 2025-12-08 23:57:50
License: 暂无描述

DataCite Commons2025-12-08 更新2026-05-07 收录

下载链接：

https://data4research.unimas.my/citation?persistentId=doi:10.60767/FK2/S8ES7Q

下载链接

链接失效反馈

官方服务：

资源简介：

This package has iban text and speech corpora used for Automatic Speech Recognition (ASR) experiments. Data is available in the subdirectories of /data. The subdirectories contain: a. train - train transcript for training ASR system using Kaldi ASR (http://kaldi.sourceforge.net/) b. test - test transcript for testing ASR system (also Kaldi ASR format) c. wav - speech corpus We have provided text corpus and language model in the /LM directory, while, the pronunciation dictionary in /lang directory. ###PUBLICATION ON IBAN DATA AND ASR ##### Details on the corpora and the our experiments on iban ASR can be found in the following list of publication. We appreciate if you cite them if you intend to publish. @inproceedings{Juan14, Author = {Sarah Samson Juan and Laurent Besacier and Solange Rossato}, Booktitle = {Proceedings of Workshop for Spoken Language Technology for Under-resourced (SLTU)}, Month = {May}, Title = {Semi-supervised G2P bootstrapping and its application to ASR for a very under-resourced language: Iban}, Year = {2014}} @inproceedings{Juan2015, Title = {Using Resources from a closely-Related language to develop ASR for a very under-resourced Language: A case study for Iban}, Author = {Sarah Samson Juan and Laurent Besacier and Benjamin Lecouteux and Mohamed Dyab}, Booktitle = {Proceedings of INTERSPEECH}, Year = {2015}, Address = {Dresden, Germany}, Month = {September}} ###IBAN SPEECH CORPUS News data provided by a local radio station in Sarawak, Malaysia. Directory: data/train Files: text (training transcript), wav.scp (file id and path to audio file), utt2spk (file id and audio id), spk2utt(audio id and file id), wav (.wav files). For more information about the format, please refer to Kaldi website http://kaldi.sourceforge.net/data_prep.html Description: training data in Kaldi format about 7 hours. Note: The path of wav files in wav.scp MUST BE MODIFIED to point to the actual location. Directory: data/test Files: text (test transcript), wav.scp (file id and path to audio file), utt2spk (file id and audio id), spk2utt(audio id and file id), wav (.wav files). Description: testing data in Kaldi format about 1 hour. Note: The path of wav files in wav.scp MUST BE MODIFIED to point to the actual location. The audio files have the format: ib[m|f]_SPK_UTT where, m refers to male and f refers to female speaker, SPK denotes speaker id and UTT is the utterance id. #### IBAN TEXT CORPUS Directory: /LM/ Files: iban-bp-2012.txt, iban-lm-o3.arpa # /iban-bp-2012.txt Contains 2 M Words. Full text data crawled from an online newspaper and cleaned as much as we could. # /iban-lm-o3.arpa The language model build on SRILM (http://www.speech.sri.com/projects/srilm/) using iban-bp-2012.txt #### LEXICON/PRONUNCIATION DICTIONARY Directory: /lang Files : lexicon.txt (lexicon), nonsilence_phones.txt (speech phones), optional_silence.txt (silence phone) Description: lexicon contains words and their respective pronunciation, non-speech sound and noise in Kaldi format. Details on the development of the dictionary can be found in our papers. (For this package, we provided the Iban-Hybrid version.) #TO DOWNLOAD THE REPOSITORY You first need to install Subversion (SVN). Then type into a shell: svn co https://github.com/sarahjuan/iban ### SCRIPTS In /kaldi-scripts, you can find all scripts that can be used to train and test models from the existing data and lang directory. Note: Path needs to changed to make it work in your own directory. You can launch run.sh to prepare data & language model, make mfccs and train acoustic models. ### WER RESULTS OBTAINED USING OUR CORPORA AND SETTINGS. RESULTS OBTAINED AFTER UPDATING TEST TRANSCRIPT. THE ONES REPORTED IN OUR PAPERS WERE BEFORE THIS UPDATE## AM WER(%) --------------------------- monophone 41.1 triphone 35.3 triphone+del+del 36.0 tri+LDA+MLLT 25.9 tri+SAT+fMLLR 19.7 SGMM 16.6 DNN 15.8 ##ACKNOWLEDGEMENT ### We would like to thank the Ministry of Higher Education Malaysia for providing financial support to conduct this study. We also thank The Borneo Post news agency for providing online materials for building the text corpus and also to Radio Televisyen Malaysia (RTM), Sarawak, Malaysia, for providing the news data.

提供机构：

UNIMAS Data for Research

创建时间：

2025-02-05

5,000+

优质数据集

54 个

任务类型

进入经典数据集