bookbot/slr72_dataset
收藏Hugging Face2025-11-18 更新2026-02-07 收录
下载链接:
https://hf-mirror.com/datasets/bookbot/slr72_dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: audio
dtype: audio
- name: text
dtype: string
- name: speaker_id
dtype: int64
- name: phonemes_ipa
sequence: string
splits:
- name: train
num_bytes: 2096285787.1674485
num_examples: 3922
- name: test
num_bytes: 524338693.8325515
num_examples: 981
download_size: 2077059933
dataset_size: 2620624481
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
license: cc-by-sa-4.0
task_categories:
- automatic-speech-recognition
language:
- es
tags:
- speech
- phonemes
pretty_name: Crowdsourced high-quality Colombian Spanish speech dataset.
---
## Dataset Description
### Dataset Summary
This dataset is a modified version of [Crowdsourced high-quality Colombian Spanish speech dataset](https://www.openslr.org/72/). We added a new `phonemes_ipa` column, which contains phonemized sentences in the IPA format. We use [babygruut](https://github.com/bookbot-hive/babygruut) to phonemize the sentences.
This dataset, originally collected by Google, provides 4,903 high-quality utterances, representing 7.58 hours of audio specifically from Colombian Spanish speakers.
### Languages
Spanish
### Sample Instance
```py
{'audio': {'path': 'com_07508_01944220782.wav',
'array': array([0.00228882, 0.00231934, 0.00198364, ..., 0.00097656, 0.00088501,
0.00097656]),
'sampling_rate': 48000},
'text': 'Quiero irme de vacaciones a Hawái pero tengo ésta duda de si el volcán está activo',
'speaker_id': 7508,
'phonemes_ipa': ['k ʝ e ɾ o',
'i ɾ m e',
'd e',
'b a k a θ ʝ o n e s',
'a',
'a ai',
'p e ɾ o',
't e n g o',
'e s t a',
'd u d a',
'd e',
's i',
'e l',
'b o l k a n',
'e s t a',
'a k t i b o']}
```
The following describe the information on each column in the dataset:
- `audio`: Contains the audio data including the file path, waveform array, and sampling rate
- `text`: Text transcription of the spoken content in Spanish
- `speaker_id`: Unique numeric identifier for the speaker
- `phonemes_ipa`: Phonetic transcription using International Phonetic Alphabet (IPA) symbols, segmented by words
### Data Splits
| Split | Number of examples |
| :---- | -----------------: |
| Train | 3922 |
| Test | 981 |
**Total**: 4,903 utterances (~7.58 hours of audio)
dataset_info: 数据集信息:
features:
- name: audio(音频), dtype: audio(音频)
- name: text, dtype: string(字符串)
- name: speaker_id(说话人ID), dtype: int64(64位整数)
- name: phonemes_ipa(IPA音素序列), dtype: 字符串序列
splits:
- name: train(训练集), num_bytes: 2096285787.1674485, num_examples: 3922
- name: test(测试集), num_bytes: 524338693.8325515, num_examples: 981
download_size: 2077059933, dataset_size: 2620624481
configs:
- config_name: default(默认配置), data_files:
- split: train, path: data/train-*
- split: test, path: data/test-*
license: CC BY-SA 4.0(知识共享署名-相同方式共享4.0国际许可协议)
task_categories:
- automatic-speech-recognition(自动语音识别)
language:
- es(西班牙语)
tags:
- speech(语音)
- phonemes(音素)
pretty_name: 众包高质量哥伦比亚西班牙语语音数据集。
## 数据集描述
### 数据集概述
本数据集为[众包高质量哥伦比亚西班牙语语音数据集](https://www.openslr.org/72/)的修改版本。我们新增了`phonemes_ipa`列,该列包含**国际音标(International Phonetic Alphabet, IPA)**格式的音素化语句。我们使用[babygruut](https://github.com/bookbot-hive/babygruut)工具完成语句的音素化处理。
该数据集最初由谷歌(Google)收集,共包含4903条高质量语音片段,总时长约7.58小时,所有语音均来自哥伦比亚西班牙语使用者。
### 语言
西班牙语
### 样本示例
py
{'audio': {'path': 'com_07508_01944220782.wav',
'array': array([0.00228882, 0.00231934, 0.00198364, ..., 0.00097656, 0.00088501,
0.00097656]),
'sampling_rate': 48000},
'text': 'Quiero irme de vacaciones a Hawái pero tengo ésta duda de si el volcán está activo',
'speaker_id': 7508,
'phonemes_ipa': ['k ʝ e ɾ o',
'i ɾ m e',
'd e',
'b a k a θ ʝ o n e s',
'a',
'a ai',
'p e ɾ o',
't e n g o',
'e s t a',
'd u d a',
'd e',
's i',
'e l',
'b o l k a n',
'e s t a',
'a k t i b o']}
以下对数据集中各列的信息进行说明:
- `audio`(音频列):包含完整音频数据,具体包括文件路径、波形数组与采样率
- `text`(文本转录列):语音内容的西班牙语文本转录结果
- `speaker_id`(说话人ID列):用于标识唯一说话人的数字型唯一标识符
- `phonemes_ipa`(IPA音素序列列):采用国际音标符号的语音转录结果,按单词进行分段处理
### 数据划分
| 划分 | 样本数量 |
| :---- | -----------------: |
| 训练集 | 3922 |
| 测试集 | 981 |
**总计**:4903条语音片段(约7.58小时语音时长)
提供机构:
bookbot



