five

bookbot/slr72_dataset

收藏
Hugging Face2025-11-18 更新2026-02-07 收录
下载链接:
https://hf-mirror.com/datasets/bookbot/slr72_dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: audio dtype: audio - name: text dtype: string - name: speaker_id dtype: int64 - name: phonemes_ipa sequence: string splits: - name: train num_bytes: 2096285787.1674485 num_examples: 3922 - name: test num_bytes: 524338693.8325515 num_examples: 981 download_size: 2077059933 dataset_size: 2620624481 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* license: cc-by-sa-4.0 task_categories: - automatic-speech-recognition language: - es tags: - speech - phonemes pretty_name: Crowdsourced high-quality Colombian Spanish speech dataset. --- ## Dataset Description ### Dataset Summary This dataset is a modified version of [Crowdsourced high-quality Colombian Spanish speech dataset](https://www.openslr.org/72/). We added a new `phonemes_ipa` column, which contains phonemized sentences in the IPA format. We use [babygruut](https://github.com/bookbot-hive/babygruut) to phonemize the sentences. This dataset, originally collected by Google, provides 4,903 high-quality utterances, representing 7.58 hours of audio specifically from Colombian Spanish speakers. ### Languages Spanish ### Sample Instance ```py {'audio': {'path': 'com_07508_01944220782.wav', 'array': array([0.00228882, 0.00231934, 0.00198364, ..., 0.00097656, 0.00088501, 0.00097656]), 'sampling_rate': 48000}, 'text': 'Quiero irme de vacaciones a Hawái pero tengo ésta duda de si el volcán está activo', 'speaker_id': 7508, 'phonemes_ipa': ['k ʝ e ɾ o', 'i ɾ m e', 'd e', 'b a k a θ ʝ o n e s', 'a', 'a ai', 'p e ɾ o', 't e n g o', 'e s t a', 'd u d a', 'd e', 's i', 'e l', 'b o l k a n', 'e s t a', 'a k t i b o']} ``` The following describe the information on each column in the dataset: - `audio`: Contains the audio data including the file path, waveform array, and sampling rate - `text`: Text transcription of the spoken content in Spanish - `speaker_id`: Unique numeric identifier for the speaker - `phonemes_ipa`: Phonetic transcription using International Phonetic Alphabet (IPA) symbols, segmented by words ### Data Splits | Split | Number of examples | | :---- | -----------------: | | Train | 3922 | | Test | 981 | **Total**: 4,903 utterances (~7.58 hours of audio)

dataset_info: 数据集信息: features: - name: audio(音频), dtype: audio(音频) - name: text, dtype: string(字符串) - name: speaker_id(说话人ID), dtype: int64(64位整数) - name: phonemes_ipa(IPA音素序列), dtype: 字符串序列 splits: - name: train(训练集), num_bytes: 2096285787.1674485, num_examples: 3922 - name: test(测试集), num_bytes: 524338693.8325515, num_examples: 981 download_size: 2077059933, dataset_size: 2620624481 configs: - config_name: default(默认配置), data_files: - split: train, path: data/train-* - split: test, path: data/test-* license: CC BY-SA 4.0(知识共享署名-相同方式共享4.0国际许可协议) task_categories: - automatic-speech-recognition(自动语音识别) language: - es(西班牙语) tags: - speech(语音) - phonemes(音素) pretty_name: 众包高质量哥伦比亚西班牙语语音数据集。 ## 数据集描述 ### 数据集概述 本数据集为[众包高质量哥伦比亚西班牙语语音数据集](https://www.openslr.org/72/)的修改版本。我们新增了`phonemes_ipa`列,该列包含**国际音标(International Phonetic Alphabet, IPA)**格式的音素化语句。我们使用[babygruut](https://github.com/bookbot-hive/babygruut)工具完成语句的音素化处理。 该数据集最初由谷歌(Google)收集,共包含4903条高质量语音片段,总时长约7.58小时,所有语音均来自哥伦比亚西班牙语使用者。 ### 语言 西班牙语 ### 样本示例 py {'audio': {'path': 'com_07508_01944220782.wav', 'array': array([0.00228882, 0.00231934, 0.00198364, ..., 0.00097656, 0.00088501, 0.00097656]), 'sampling_rate': 48000}, 'text': 'Quiero irme de vacaciones a Hawái pero tengo ésta duda de si el volcán está activo', 'speaker_id': 7508, 'phonemes_ipa': ['k ʝ e ɾ o', 'i ɾ m e', 'd e', 'b a k a θ ʝ o n e s', 'a', 'a ai', 'p e ɾ o', 't e n g o', 'e s t a', 'd u d a', 'd e', 's i', 'e l', 'b o l k a n', 'e s t a', 'a k t i b o']} 以下对数据集中各列的信息进行说明: - `audio`(音频列):包含完整音频数据,具体包括文件路径、波形数组与采样率 - `text`(文本转录列):语音内容的西班牙语文本转录结果 - `speaker_id`(说话人ID列):用于标识唯一说话人的数字型唯一标识符 - `phonemes_ipa`(IPA音素序列列):采用国际音标符号的语音转录结果,按单词进行分段处理 ### 数据划分 | 划分 | 样本数量 | | :---- | -----------------: | | 训练集 | 3922 | | 测试集 | 981 | **总计**:4903条语音片段(约7.58小时语音时长)
提供机构:
bookbot
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作