bookbot/slr72_dataset

Name: bookbot/slr72_dataset
Creator: bookbot
Published: 2025-11-18 01:10:24
License: 暂无描述

Hugging Face2025-11-18 更新2026-02-07 收录

下载链接：

https://hf-mirror.com/datasets/bookbot/slr72_dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: audio dtype: audio - name: text dtype: string - name: speaker_id dtype: int64 - name: phonemes_ipa sequence: string splits: - name: train num_bytes: 2096285787.1674485 num_examples: 3922 - name: test num_bytes: 524338693.8325515 num_examples: 981 download_size: 2077059933 dataset_size: 2620624481 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* license: cc-by-sa-4.0 task_categories: - automatic-speech-recognition language: - es tags: - speech - phonemes pretty_name: Crowdsourced high-quality Colombian Spanish speech dataset. --- ## Dataset Description ### Dataset Summary This dataset is a modified version of [Crowdsourced high-quality Colombian Spanish speech dataset](https://www.openslr.org/72/). We added a new `phonemes_ipa` column, which contains phonemized sentences in the IPA format. We use [babygruut](https://github.com/bookbot-hive/babygruut) to phonemize the sentences. This dataset, originally collected by Google, provides 4,903 high-quality utterances, representing 7.58 hours of audio specifically from Colombian Spanish speakers. ### Languages Spanish ### Sample Instance ```py {'audio': {'path': 'com_07508_01944220782.wav', 'array': array([0.00228882, 0.00231934, 0.00198364, ..., 0.00097656, 0.00088501, 0.00097656]), 'sampling_rate': 48000}, 'text': 'Quiero irme de vacaciones a Hawái pero tengo ésta duda de si el volcán está activo', 'speaker_id': 7508, 'phonemes_ipa': ['k ʝ e ɾ o', 'i ɾ m e', 'd e', 'b a k a θ ʝ o n e s', 'a', 'a ai', 'p e ɾ o', 't e n g o', 'e s t a', 'd u d a', 'd e', 's i', 'e l', 'b o l k a n', 'e s t a', 'a k t i b o']} ``` The following describe the information on each column in the dataset: - `audio`: Contains the audio data including the file path, waveform array, and sampling rate - `text`: Text transcription of the spoken content in Spanish - `speaker_id`: Unique numeric identifier for the speaker - `phonemes_ipa`: Phonetic transcription using International Phonetic Alphabet (IPA) symbols, segmented by words ### Data Splits | Split | Number of examples | | :---- | -----------------: | | Train | 3922 | | Test | 981 | **Total**: 4,903 utterances (~7.58 hours of audio)

dataset_info: 数据集信息： features: - name: audio（音频）, dtype: audio（音频） - name: text, dtype: string（字符串） - name: speaker_id（说话人ID）, dtype: int64（64位整数） - name: phonemes_ipa（IPA音素序列）, dtype: 字符串序列 splits: - name: train（训练集）, num_bytes: 2096285787.1674485, num_examples: 3922 - name: test（测试集）, num_bytes: 524338693.8325515, num_examples: 981 download_size: 2077059933, dataset_size: 2620624481 configs: - config_name: default（默认配置）, data_files: - split: train, path: data/train-* - split: test, path: data/test-* license: CC BY-SA 4.0（知识共享署名-相同方式共享4.0国际许可协议） task_categories: - automatic-speech-recognition（自动语音识别） language: - es（西班牙语） tags: - speech（语音） - phonemes（音素） pretty_name: 众包高质量哥伦比亚西班牙语语音数据集。 ## 数据集描述 ### 数据集概述本数据集为[众包高质量哥伦比亚西班牙语语音数据集](https://www.openslr.org/72/)的修改版本。我们新增了`phonemes_ipa`列，该列包含**国际音标（International Phonetic Alphabet, IPA）**格式的音素化语句。我们使用[babygruut](https://github.com/bookbot-hive/babygruut)工具完成语句的音素化处理。该数据集最初由谷歌（Google）收集，共包含4903条高质量语音片段，总时长约7.58小时，所有语音均来自哥伦比亚西班牙语使用者。 ### 语言西班牙语 ### 样本示例 py {'audio': {'path': 'com_07508_01944220782.wav', 'array': array([0.00228882, 0.00231934, 0.00198364, ..., 0.00097656, 0.00088501, 0.00097656]), 'sampling_rate': 48000}, 'text': 'Quiero irme de vacaciones a Hawái pero tengo ésta duda de si el volcán está activo', 'speaker_id': 7508, 'phonemes_ipa': ['k ʝ e ɾ o', 'i ɾ m e', 'd e', 'b a k a θ ʝ o n e s', 'a', 'a ai', 'p e ɾ o', 't e n g o', 'e s t a', 'd u d a', 'd e', 's i', 'e l', 'b o l k a n', 'e s t a', 'a k t i b o']} 以下对数据集中各列的信息进行说明： - `audio`（音频列）：包含完整音频数据，具体包括文件路径、波形数组与采样率 - `text`（文本转录列）：语音内容的西班牙语文本转录结果 - `speaker_id`（说话人ID列）：用于标识唯一说话人的数字型唯一标识符 - `phonemes_ipa`（IPA音素序列列）：采用国际音标符号的语音转录结果，按单词进行分段处理 ### 数据划分 | 划分 | 样本数量 | | :---- | -----------------: | | 训练集 | 3922 | | 测试集 | 981 | **总计**：4903条语音片段（约7.58小时语音时长）

提供机构：

bookbot

5,000+

优质数据集

54 个

任务类型

进入经典数据集