five

yjhuang01/Hokchia

收藏
Hugging Face2024-03-17 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/yjhuang01/Hokchia
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - zh license: - mit multilinguality: - monolingual size_categories: - n<1K source_datasets: - original task_categories: - automatic-speech-recognition --- # Hokchia Audio Dataset Hokchia, or the Fuqing dialect, is a branch of Eastern Min Chinese spoken mainly in the Fuqing City of Fujian province, China. Unlike Hokkien, which is more widely recognized and spoken in various parts of Southeast Asia, Hokchia maintains its unique linguistic characteristics and is primarily used within the Fuqing community and its diaspora. This dialect is known for its distinct pronunciation, vocabulary, and grammatical structures compared to other Min Chinese varieties. The Hokchia audio dataset is designed to foster speech recognition technologies that cater to this unique dialect, aiming to enhance digital inclusivity for Hokchia speakers worldwide. ## Dataset Description The Hokchia Audio Dataset is a collection of audio recordings in the Hokchia language, accompanied by transcriptions. It is intended for use in speech recognition models, particularly to fine-tune models like Whisper for the Hokchia language. The dataset includes a wide range of spoken content, making it suitable for various applications requiring speech-to-text capabilities in Hokchia. ## Content Each audio file in the dataset is named following the pattern `Hokchia_X.wav`, where `X` is a numerical identifier. Accompanying each audio file is a JSON line in the `whisper_finetune_input.jsonl` file, providing the text transcription of the audio content. The dataset structure is as follows: - `README.md`: This file. - `dataset/`: Directory containing audio files split into subdirectories by language. - `Hokchia/`: Subdirectory containing Hokchia audio files. - `whisper_finetune_input.jsonl`: JSON Lines file containing mappings of audio file paths to their text transcriptions. ## Dataset Structure Each line in the `whisper_finetune_input.jsonl` file represents a data point in the following format: ```json {"audio_filepath": "./dataset/Hokchia/Hokchia_1.wav", "text": "text transcription here"} ``` audio_filepath: Relative path to the audio file. text: Transcription of the audio in Hokchia. Use Cases This dataset is particularly suited for: Training and fine-tuning speech recognition models on the Hokchia language. Linguistic studies focusing on the Hokchia dialect. Developing voice-activated applications that require understanding of Hokchia. How to Use You can load this dataset using the Hugging Face datasets library: ```python from datasets import load_dataset dataset = load_dataset("AnnoFichel/hokchia_audio_dataset") ``` Acknowledgements This dataset was collected and prepared by Jack Huang. We acknowledge the contributions of the speakers who participated in the recording sessions and the individuals who provided transcriptions.
提供机构:
yjhuang01
原始信息汇总

Hokchia Audio Dataset 概述

基本信息

  • 语言: 中文(Hokchia 方言)
  • 许可: MIT
  • 多语言性: 单语种
  • 数据集大小: 小于1千条
  • 数据来源: 原创
  • 任务类别: 自动语音识别

数据集描述

Hokchia Audio Dataset 是一个包含Hokchia语言音频记录及其转录文本的数据集。该数据集专为语音识别模型设计,特别是用于微调如Whisper等模型以适应Hokchia语言。数据集包含多样化的口语内容,适用于需要Hokchia语音转文本能力的各种应用。

内容结构

  • 音频文件: 命名格式为 Hokchia_X.wav,其中 X 是数字标识符。
  • 转录文本: 每个音频文件对应一个JSON行,位于 whisper_finetune_input.jsonl 文件中,提供音频内容的文本转录。
  • 文件结构:
    • README.md: 本文件。
    • dataset/: 包含音频文件的目录,按语言分割子目录。
      • Hokchia/: 包含Hokchia音频文件的子目录。
    • whisper_finetune_input.jsonl: 包含音频文件路径与其文本转录映射的JSON Lines文件。

数据点格式

每个数据点在 whisper_finetune_input.jsonl 文件中的格式如下: json {"audio_filepath": "./dataset/Hokchia/Hokchia_1.wav", "text": "text transcription here"}

  • audio_filepath: 音频文件的相对路径。
  • text: Hokchia语言的音频转录文本。

使用场景

  • 训练和微调Hokchia语言的语音识别模型。
  • 专注于Hokchia方言的语言学研究。
  • 开发需要理解Hokchia的语音激活应用程序。
搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作