meisin123/iban_speech_corpus
收藏Hugging Face2023-11-02 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/meisin123/iban_speech_corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: audio
dtype: audio
- name: transcription
dtype: string
splits:
- name: train
num_bytes: 1014986154.58
num_examples: 3132
download_size: 981436514
dataset_size: 1014986154.58
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# Dataset Card for "iban_speech_corpus"
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [How to use](#how-to-use)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Dataset Creation](#dataset-creation)
- [Source Data](#source-data)
- [Additional Information](#additional-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Repository:** The original dataset is found on [Sarah Juan's github link](https://github.com/sarahjuan/iban)
- **Paper:** "Using Resources from a closely-Related language to develop ASR for a very under-resourced Language: A case study for Iban"
### Dataset Summary
This Iban speech corpus is used for training of a Automatic Speech Recognition (ASR) model. This dataset contains the audio files (wav files) with its corresponding transcription.
For other resources such as pronunciation dictionary and Iban language model, please refer to the original dataset respository [here](https://github.com/sarahjuan/iban).
### How to use
The `datasets` library allows you to load and pre-process your dataset in pure Python, at scale. The dataset can be downloaded and prepared in one call to your local drive by using the `load_dataset` function.
```python
from datasets import load_dataset
dataset = load_dataset("meisin123/iban_speech_corpus", split="train")
```
## Dataset Structure
### Data Instances
```
{'audio': {'path': 'ibf_001_001.wav',
'array': array([ 5.72814941e-01, 5.49011230e-01, -1.82495117e-02, ...,
-2.31628418e-02, -1.26342773e-02, -3.05175781e-05]),
'sampling_rate': 16000},
'transcription': 'pukul sepuluh malam'}
```
### Data Fields
- audio: A dictionary containing the audio filename, the decoded audio array, and the sampling rate.
- transcription: the transcription of the audio file.
## Dataset Creation
- Iban Data collected by Sarah Samson Juan and Laurent Besacier
### Source Data
The audio files are news data provided by a local radio station in Sarawak, Malaysia.
## Additional Information
### Citation Information
Details on the corpora and the experiments on iban ASR can be found in the following list of publication. The original authors appreciate if you cite them if you intend to publish.
```
@inproceedings{Juan14,
Author = {Sarah Samson Juan and Laurent Besacier and Solange Rossato},
Booktitle = {Proceedings of Workshop for Spoken Language Technology for Under-resourced (SLTU)},
Month = {May},
Title = {Semi-supervised G2P bootstrapping and its application to ASR for a very under-resourced language: Iban},
Year = {2014}}
@inproceedings{Juan2015,
Title = {Using Resources from a closely-Related language to develop ASR for a very under-resourced Language: A case study for Iban},
Author = {Sarah Samson Juan and Laurent Besacier and Benjamin Lecouteux and Mohamed Dyab},
Booktitle = {Proceedings of INTERSPEECH},
Year = {2015},
Address = {Dresden, Germany},
Month = {September}}
```
### Contributions
Thanks to [meisin](https://github.com/meisin) for adding this dataset.
提供机构:
meisin123
原始信息汇总
数据集概述
数据集名称
Iban Speech Corpus
数据集用途
用于训练自动语音识别(ASR)模型。
数据集内容
包含音频文件(wav格式)及其对应的转录文本。



