meisin123/iban_speech_corpus

Name: meisin123/iban_speech_corpus
Creator: meisin123
Published: 2023-11-02 04:39:07
License: 暂无描述

Hugging Face2023-11-02 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/meisin123/iban_speech_corpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: audio dtype: audio - name: transcription dtype: string splits: - name: train num_bytes: 1014986154.58 num_examples: 3132 download_size: 981436514 dataset_size: 1014986154.58 configs: - config_name: default data_files: - split: train path: data/train-* --- # Dataset Card for "iban_speech_corpus" ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [How to use](#how-to-use) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Dataset Creation](#dataset-creation) - [Source Data](#source-data) - [Additional Information](#additional-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Repository:** The original dataset is found on [Sarah Juan's github link](https://github.com/sarahjuan/iban) - **Paper:** "Using Resources from a closely-Related language to develop ASR for a very under-resourced Language: A case study for Iban" ### Dataset Summary This Iban speech corpus is used for training of a Automatic Speech Recognition (ASR) model. This dataset contains the audio files (wav files) with its corresponding transcription. For other resources such as pronunciation dictionary and Iban language model, please refer to the original dataset respository [here](https://github.com/sarahjuan/iban). ### How to use The `datasets` library allows you to load and pre-process your dataset in pure Python, at scale. The dataset can be downloaded and prepared in one call to your local drive by using the `load_dataset` function. ```python from datasets import load_dataset dataset = load_dataset("meisin123/iban_speech_corpus", split="train") ``` ## Dataset Structure ### Data Instances ``` {'audio': {'path': 'ibf_001_001.wav', 'array': array([ 5.72814941e-01, 5.49011230e-01, -1.82495117e-02, ..., -2.31628418e-02, -1.26342773e-02, -3.05175781e-05]), 'sampling_rate': 16000}, 'transcription': 'pukul sepuluh malam'} ``` ### Data Fields - audio: A dictionary containing the audio filename, the decoded audio array, and the sampling rate. - transcription: the transcription of the audio file. ## Dataset Creation - Iban Data collected by Sarah Samson Juan and Laurent Besacier ### Source Data The audio files are news data provided by a local radio station in Sarawak, Malaysia. ## Additional Information ### Citation Information Details on the corpora and the experiments on iban ASR can be found in the following list of publication. The original authors appreciate if you cite them if you intend to publish. ``` @inproceedings{Juan14, Author = {Sarah Samson Juan and Laurent Besacier and Solange Rossato}, Booktitle = {Proceedings of Workshop for Spoken Language Technology for Under-resourced (SLTU)}, Month = {May}, Title = {Semi-supervised G2P bootstrapping and its application to ASR for a very under-resourced language: Iban}, Year = {2014}} @inproceedings{Juan2015, Title = {Using Resources from a closely-Related language to develop ASR for a very under-resourced Language: A case study for Iban}, Author = {Sarah Samson Juan and Laurent Besacier and Benjamin Lecouteux and Mohamed Dyab}, Booktitle = {Proceedings of INTERSPEECH}, Year = {2015}, Address = {Dresden, Germany}, Month = {September}} ``` ### Contributions Thanks to [meisin](https://github.com/meisin) for adding this dataset.

提供机构：

meisin123

原始信息汇总

数据集概述

数据集名称

Iban Speech Corpus

数据集用途

用于训练自动语音识别（ASR）模型。

数据集内容

包含音频文件（wav格式）及其对应的转录文本。

5,000+

优质数据集

54 个

任务类型

进入经典数据集