five

hey-buddy

收藏
魔搭社区2025-12-05 更新2025-03-22 收录
下载链接:
https://modelscope.cn/datasets/benjamin-paine/hey-buddy
下载链接
链接失效反馈
官方服务:
资源简介:
<div align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/64429aaf7feb866811b12f73/MPNTk4yaeh-shgLCv4tXg.png" width=768 height=768 /> </div> # Precalculated Datasets You do *not* need to download these datasets manually if you are using `heybuddy`; they will automatically be downloaded when using the command-line trainer. However, if you wish to make your own datasets or want to deploy **heybuddy** in a pre-configured manner, links are provided on this page. Precalculated datasets are of the shape `(n, 17, 96)`. The first `16` columns along `axis=1` represent the speech embeddings of the audio data, and the last column is the tokenized transcription, zero-padded/truncated to match length. The tokenized transcription should not be fed to the model during training, instead it should be used to filter out training audio that may contain your wake phrase. This filtration improves the final model's recall by up to 50%, depending on the common-ness of your phrase. ## Training Note that this training data is downcasted to `float16`. This reduces it's accuracy slightly, but cuts the large file size in half. ### Metadata | | Combined | Part 1 | Part 2 | | -- | -- | -- | -- | | Download | N/A | [Download Part 1](https://huggingface.co/benjamin-paine/world-wide-web-wake-word/resolve/main/precalculated/training-1.npy) | [Download Part 2](https://huggingface.co/benjamin-paine/world-wide-web-wake-word/resolve/main/precalculated/training-2.npy) | | Size | `72 GB` | `46 GB` | `25 GB` | | Hours | ~6500 | ~4200 | ~2300 | | Shape| `(23341584, 17, 96)` | `(15012254, 17, 96)` | `(8329330, 17, 96)` | | Type | `float16` | | License | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) | ### Constituent Datasets | Dataset | Hours | License | | ------- | ----- | ------- | | [parler-tts/mls_eng:train](https://huggingface.co/datasets/parler-tts/mls_eng/viewer/default/train) | ~2500 hours | CC-BY 4.0 | | [mozilla-foundation/common_voice_17_0:en:train](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/en/train) | ~1000 hours | CC0 1.0 | | [homebrewltd/instruction-speech-encodec-v1](https://huggingface.co/datasets/homebrewltd/instruction-speech-encodec-v1) | ~650 hours | MIT | | [mozilla-foundation/common_voice_17_0:de:train](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/de/train) | ~500 hours | CC0 1.0 | | [mozilla-foundation/common_voice_17_0:fr:train](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/fr/train) | ~475 hours | CC0 1.0 | | [MushanW/GLOBE:train](https://huggingface.co/datasets/MushanW/GLOBE) | ~350 hours | CC0 1.0 | | [mozilla-foundation/common_voice_17_0:es:train](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/es/train) | ~275 hours | CC0 1.0 | | [facebook/voxpopuli:en:train](https://huggingface.co/datasets/facebook/voxpopuli/viewer/en/train) | ~200 hours | CC0 1.0 | | [mozilla-foundation/common_voice_17_0:eo:train](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/eo/train) | ~150 hours | CC0 1.0 | | [benjamin-paine/freesound-laion-640k:train](https://huggingface.co/datasets/benjamin-paine/freesound-laion-640k) | ~125 hours | CC0 1.0, CC-BY 4.0, CC-BY 3.0, CC-Sampling+ *(excluded CC-BY-NC samples)* | | [benjamin-paine/dinner-party-corpus:split-channel:train](https://huggingface.co/datasets/benjamin-paine/dinner-party-corpus/viewer/split-channel) | ~75 hours | CDLA-Permissive 1.0 | | [mozilla-foundation/common_voice_17_0:sw:train](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/sw/train) | ~50 hours | CC0 1.0 | | [mozilla-foundation/common_voice_17_0:zh-CN:train](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/zh-CN/train) | ~25 hours | CC0 1.0 | | [mozilla-foundation/common_voice_17_0:ar:train](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/ar/train) | ~20 hours | CC0 1.0 | | [google/fleurs:en_us:train](https://huggingface.co/datasets/google/fleurs/viewer/en_us/train) | ~5 hours | CC-BY 4.0 | | [mozilla-foundation/common_voice_17_0:hi:train](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/hi/train) | ~5 hours | CC0 1.0 | ## Validation We do **not** downcast the validation data set in the hopes of encouraging accurate validations. ### Metadata | | | | -- | -- | | Download | [Download](https://huggingface.co/benjamin-paine/world-wide-web-wake-word/resolve/main/precalculated/validation.npy) | | Size | `238 MB` | | Hours | ~35 | | Shape | `(63100, 17, 96)` | | Type | `float32` | | License | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) | ### Constituent Datasets | Dataset | Hours | License | | ------- | ----- | ------- | | [benjamin-paine/dinner-party-corpus:mixed-channel:test](https://huggingface.co/datasets/benjamin-paine/dinner-party-corpus/viewer/mixed-channel/test) | ~10 hours | CDLA-Permissive 1.0 | | [parler-tts/mls_eng:test](https://huggingface.co/datasets/parler-tts/mls_eng/viewer/default/test) | ~5 hours | CC-BY 4.0 | | [mozilla-foundation/common_voice_17_0:en:validation](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/en/validation) | ~5 hours | CC0 1.0 | | [facebook/voxpopuli:en:validation](https://huggingface.co/datasets/facebook/voxpopuli/viewer/en/validation) | ~5 hours | CC0 1.0 | | [google/fleurs:en_us:validation](https://huggingface.co/datasets/google/fleurs/viewer/en_us/validation) | ~5 hours | CC-BY 4.0 | | [gpt-omni/VoiceAssistant-400K:train](https://huggingface.co/datasets/gpt-omni/VoiceAssistant-400K/viewer/default/train) | ~5 hours | Apache 2.0 | ## Creating your Own The precalculation code is provided to allow you to create your own training and validation datasets, if you wish. The general command is as follows, where `$NAME` is the name you want to give to the dataset (for example, `my-validation`), and `$REPO_ID` is the path to the huggingface repository in the form of `username/repository`. ```sh heybuddy extract $NAME $REPO_ID heybuddy combine $NAME --delete ``` ### Extended Options ```sh Usage: heybuddy extract [OPTIONS] NAME REPO_ID Creates a dataset of speech embeddings from a given repository. Options: --config TEXT The configuration name to create the dataset from (when multiple configs are supported.) --split TEXT Split to create the dataset from. [default: train] --audio-key TEXT Key in the dataset for the audio data. [default: audio] --audio-array-key TEXT Key in the audio data for the waveform. [default: array] --audio-sample-rate-key TEXT Key in the audio data for the sample rate. [default: sampling_rate] --transcript-key TEXT Key in the dataset for the transcript data. [default: transcript] --streaming Stream the dataset, instead of downloading first. [default: True] --hours FLOAT Hours of audio to process. [default: 1000.0] --samples-per-file INTEGER Number of samples per file. [default: 10000] --device-id INTEGER Device ID to use for processing. None uses CPU. --sample-rate INTEGER Sample rate to resample audio to. [default: 16000] --seconds-per-batch FLOAT Seconds of audio to process per batch. [default: 1.56] --process-batch-size INTEGER Batch size for processing audio files. [default: 100] --embedding-batch-size INTEGER Batch size for extracting embeddings. [default: 32] --tokenizer-max-length INTEGER Maximum length for the tokenizer. [default: 96] --help Show this message and exit. ``` The resulting `.npy` file will be saved in `heybuddy`s `precalculated` directory by default, and can be passed to the `train` command with `--training-dataset <file>`. # Citations ``` @article{Pratap2020MLSAL, title={MLS: A Large-Scale Multilingual Dataset for Speech Research}, author={Vineel Pratap and Qiantong Xu and Anuroop Sriram and Gabriel Synnaeve and Ronan Collobert}, journal={ArXiv}, year={2020}, volume={abs/2012.03411} } ``` ``` @inproceedings{commonvoice:2020, author = {Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.}, title = {Common Voice: A Massively-Multilingual Speech Corpus}, booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)}, pages = {4211--4215}, year = 2020 } ``` ``` @misc{wang2024globe, title={GLOBE: A High-quality English Corpus with Global Accents for Zero-shot Speaker Adaptive Text-to-Speech}, author={Wenbin Wang and Yang Song and Sanjay Jha}, year={2024}, eprint={2406.14875}, archivePrefix={arXiv}, } ``` ``` @article{Instruction Speech 2024, title={Instruction Speech}, author={JanAI}, year=2024, month=June}, url={https://huggingface.co/datasets/jan-hq/instruction-speech} } ``` ``` @inproceedings{wang-etal-2021-voxpopuli, title = "{V}ox{P}opuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation", author = "Wang, Changhan and Riviere, Morgane and Lee, Ann and Wu, Anne and Talnikar, Chaitanya and Haziza, Daniel and Williamson, Mary and Pino, Juan and Dupoux, Emmanuel", booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.acl-long.80", pages = "993--1003", } ``` ``` @article{fleurs2022arxiv, title = {FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech}, author = {Conneau, Alexis and Ma, Min and Khanuja, Simran and Zhang, Yu and Axelrod, Vera and Dalmia, Siddharth and Riesa, Jason and Rivera, Clara and Bapna, Ankur}, journal={arXiv preprint arXiv:2205.12446}, url = {https://arxiv.org/abs/2205.12446}, year = {2022}, } ``` ``` @misc{vansegbroeck2019dipcodinnerparty, title={DiPCo -- Dinner Party Corpus}, author={Maarten Van Segbroeck and Ahmed Zaid and Ksenia Kutsenko and Cirenia Huerta and Tinh Nguyen and Xuewen Luo and Björn Hoffmeister and Jan Trmal and Maurizio Omologo and Roland Maas}, year={2019}, eprint={1909.13447}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={https://arxiv.org/abs/1909.13447}, } ```

<div align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/64429aaf7feb866811b12f73/MPNTk4yaeh-shgLCv4tXg.png" width=768 height=768 /> </div> # 预计算数据集 若您使用`heybuddy`工具,则**无需手动下载**此类数据集,命令行训练器运行时将自动完成下载。但若您希望自定义数据集,或以预配置方式部署**heybuddy**,本页面已提供相关下载链接。 预计算数据集的张量形状为`(n, 17, 96)`。其中沿轴1(axis=1)的前16列对应音频数据的**语音嵌入(speech embeddings)**,最后一列则为经过分词的转录文本(tokenized transcription),该文本已通过零填充或截断操作统一长度。 分词转录文本不应在训练过程中输入模型,而应用于过滤包含自定义唤醒词(wake phrase)的训练音频。根据唤醒词的常用程度,该过滤步骤可将最终模型的召回率(recall)提升最高达50%。 ## 训练设置 请注意,本训练数据已被转换为`float16`精度格式。该操作会小幅降低数据精度,但可将庞大的文件体积缩减一半。 ### 元数据 | | 总计 | 第一部分 | 第二部分 | | -- | -- | -- | -- | | 下载 | N/A | [下载第一部分](https://huggingface.co/benjamin-paine/world-wide-web-wake-word/resolve/main/precalculated/training-1.npy) | [下载第二部分](https://huggingface.co/benjamin-paine/world-wide-web-wake-word/resolve/main/precalculated/training-2.npy) | | 大小 | `72 GB` | `46 GB` | `25 GB` | | 时长 | ~6500小时 | ~4200小时 | ~2300小时 | | 张量形状 | `(23341584, 17, 96)` | `(15012254, 17, 96)` | `(8329330, 17, 96)` | | 数据类型 | `float16` | | | 授权协议 | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) | | ### 组成数据集 | 数据集 | 时长 | 授权协议 | | ------- | ----- | ------- | | [parler-tts/mls_eng:train](https://huggingface.co/datasets/parler-tts/mls_eng/viewer/default/train) | ~2500小时 | CC-BY 4.0 | | [mozilla-foundation/common_voice_17_0:en:train](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/en/train) | ~1000小时 | CC0 1.0 | | [homebrewltd/instruction-speech-encodec-v1](https://huggingface.co/datasets/homebrewltd/instruction-speech-encodec-v1) | ~650小时 | MIT | | [mozilla-foundation/common_voice_17_0:de:train](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/de/train) | ~500小时 | CC0 1.0 | | [mozilla-foundation/common_voice_17_0:fr:train](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/fr/train) | ~475小时 | CC0 1.0 | | [MushanW/GLOBE:train](https://huggingface.co/datasets/MushanW/GLOBE) | ~350小时 | CC0 1.0 | | [mozilla-foundation/common_voice_17_0:es:train](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/es/train) | ~275小时 | CC0 1.0 | | [facebook/voxpopuli:en:train](https://huggingface.co/datasets/facebook/voxpopuli/viewer/en/train) | ~200小时 | CC0 1.0 | | [mozilla-foundation/common_voice_17_0:eo:train](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/eo/train) | ~150小时 | CC0 1.0 | | [benjamin-paine/freesound-laion-640k:train](https://huggingface.co/datasets/benjamin-paine/freesound-laion-640k) | ~125小时 | CC0 1.0, CC-BY 4.0, CC-BY 3.0, CC-Sampling+ *(排除CC-BY-NC样本)* | | [benjamin-paine/dinner-party-corpus:split-channel:train](https://huggingface.co/datasets/benjamin-paine/dinner-party-corpus/viewer/split-channel) | ~75小时 | CDLA-Permissive 1.0 | | [mozilla-foundation/common_voice_17_0:sw:train](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/sw/train) | ~50小时 | CC0 1.0 | | [mozilla-foundation/common_voice_17_0:zh-CN:train](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/zh-CN/train) | ~25小时 | CC0 1.0 | | [mozilla-foundation/common_voice_17_0:ar:train](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/ar/train) | ~20小时 | CC0 1.0 | | [google/fleurs:en_us:train](https://huggingface.co/datasets/google/fleurs/viewer/en_us/train) | ~5小时 | CC-BY 4.0 | | [mozilla-foundation/common_voice_17_0:hi:train](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/hi/train) | ~5小时 | CC0 1.0 | ## 验证集 为保障验证结果的准确性,本验证集未进行精度降级处理。 ### 元数据 | | | | -- | -- | | 下载 | [下载](https://huggingface.co/benjamin-paine/world-wide-web-wake-word/resolve/main/precalculated/validation.npy) | | 大小 | `238 MB` | | 时长 | ~35小时 | | 张量形状 | `(63100, 17, 96)` | | 数据类型 | `float32` | | 授权协议 | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) | ### 组成数据集 | 数据集 | 时长 | 授权协议 | | ------- | ----- | ------- | | [benjamin-paine/dinner-party-corpus:mixed-channel:test](https://huggingface.co/datasets/benjamin-paine/dinner-party-corpus/viewer/mixed-channel/test) | ~10小时 | CDLA-Permissive 1.0 | | [parler-tts/mls_eng:test](https://huggingface.co/datasets/parler-tts/mls_eng/viewer/default/test) | ~5小时 | CC-BY 4.0 | | [mozilla-foundation/common_voice_17_0:en:validation](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/en/validation) | ~5小时 | CC0 1.0 | | [facebook/voxpopuli:en:validation](https://huggingface.co/datasets/facebook/voxpopuli/viewer/en/validation) | ~5小时 | CC0 1.0 | | [google/fleurs:en_us:validation](https://huggingface.co/datasets/google/fleurs/viewer/en_us/validation) | ~5小时 | CC-BY 4.0 | | [gpt-omni/VoiceAssistant-400K:train](https://huggingface.co/datasets/gpt-omni/VoiceAssistant-400K/viewer/default/train) | ~5小时 | Apache 2.0 | ## 自定义数据集创建 若您有需求,可使用官方提供的预计算代码自行构建训练集与验证集。 通用命令格式如下,其中`$NAME`为您为数据集指定的名称(例如`my-validation`),`$REPO_ID`为Hugging Face(Hugging Face)仓库路径,格式为`用户名/仓库名`。 sh heybuddy extract $NAME $REPO_ID heybuddy combine $NAME --delete ### 扩展选项 sh Usage: heybuddy extract [OPTIONS] NAME REPO_ID 从指定仓库中提取语音嵌入(speech embeddings)以构建数据集。 Options: --config TEXT 指定创建数据集所用的配置名称(当支持多配置时)。 --split TEXT 指定要处理的数据集拆分方式。[默认值: train] --audio-key TEXT 数据集中存储音频数据的字段名。[默认值: audio] --audio-array-key TEXT 音频数据中存储波形的字段名。[默认值: array] --audio-sample-rate-key TEXT 音频数据中存储采样率的字段名。[默认值: sampling_rate] --transcript-key TEXT 数据集中存储转录文本的字段名。[默认值: transcript] --streaming 采用流式加载数据集,而非先下载完整数据集。[默认值: True] --hours FLOAT 指定要处理的音频总时长(小时)。[默认值: 1000.0] --samples-per-file INTEGER 每个输出文件包含的样本数。[默认值: 10000] --device-id INTEGER 指定处理所用的设备ID,若为None则使用CPU。 --sample-rate INTEGER 将音频重采样至目标采样率。[默认值: 16000] --seconds-per-batch FLOAT 每个处理批次的音频时长(秒)。[默认值: 1.56] --process-batch-size INTEGER 处理音频文件时的批次大小。[默认值: 100] --embedding-batch-size INTEGER 提取语音嵌入时的批次大小。[默认值: 32] --tokenizer-max-length INTEGER 分词器的最大序列长度。[默认值: 96] --help 显示此帮助信息并退出。 生成的`.npy`文件默认将存储于`heybuddy`的`precalculated`目录中,您可通过`--training-dataset <file>`参数将其传入训练命令。 # 引用文献 @article{Pratap2020MLSAL, title={MLS:用于语音研究的大规模多语言数据集}, author={Vineel Pratap and Qiantong Xu and Anuroop Sriram and Gabriel Synnaeve and Ronan Collobert}, journal={ArXiv}, year={2020}, volume={abs/2012.03411} } @inproceedings{commonvoice:2020, author = {Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.}, title = {Common Voice:大规模多语言语音语料库}, booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)}, pages = {4211--4215}, year = 2020 } @misc{wang2024globe, title={GLOBE:用于零样本说话人自适应文本到语音的高质量多语言口音英语语料库}, author={Wenbin Wang and Yang Song and Sanjay Jha}, year={2024}, eprint={2406.14875}, archivePrefix={arXiv}, } @article{Instruction Speech 2024, title={指令语音语料库}, author={JanAI}, year=2024, month=June}, url={https://huggingface.co/datasets/jan-hq/instruction-speech} } @inproceedings{wang-etal-2021-voxpopuli, title = {VoxPopuli:用于表征学习、半监督学习与语音理解的大规模多语言语音语料库}, author = "Wang, Changhan and Riviere, Morgane and Lee, Ann and Wu, Anne and Talnikar, Chaitanya and Haziza, Daniel and Williamson, Mary and Pino, Juan and Dupoux, Emmanuel", booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.acl-long.80", pages = "993--1003", } @article{fleurs2022arxiv, title = {FLEURS:语音通用表征的少样本学习评估}, author = {Conneau, Alexis and Ma, Min and Khanuja, Simran and Zhang, Yu and Axelrod, Vera and Dalmia, Siddharth and Riesa, Jason and Rivera, Clara and Bapna, Ankur}, journal={arXiv preprint arXiv:2205.12446}, url = {https://arxiv.org/abs/2205.12446}, year = {2022}, } @misc{vansegbroeck2019dipcodinnerparty, title={DiPCo——晚宴对话语料库}, author={Maarten Van Segbroeck and Ahmed Zaid and Ksenia Kutsenko and Cirenia Huerta and Tinh Nguyen and Xuewen Luo and Björn Hoffmeister and Jan Trmal and Maurizio Omologo and Roland Maas}, year={2019}, eprint={1909.13447}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={https://arxiv.org/abs/1909.13447}, }
提供机构:
maas
创建时间:
2025-03-18
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作