zoengjyutgaai
收藏魔搭社区2025-11-05 更新2025-03-15 收录
下载链接:
https://modelscope.cn/datasets/pengzhendong/zoengjyutgaai
下载链接
链接失效反馈官方服务:
资源简介:
# 張悦楷講古語音數據集
[English](#the-zoeng-jyut-gaai-story-telling-speech-dataset)
## Dataset Description
- **Homepage:** [張悦楷講古語音數據集 The Zoeng Jyut Gaai Story-telling Speech Dataset](https://canclid.github.io/zoengjyutgaai/)
- **License:** [CC0 1.0 Universal](https://creativecommons.org/publicdomain/zero/1.0/)
- **Language:** Cantonese
- **Total Duration:** 112.54 hours
- **Average Clip Duration:** 5.901 seconds
- **Median Clip Duration:** 5.443 seconds
- **Total number of characters:** 1679097
- **Average characters per clip:** 24.36
- **Median characters per clip:** 23
- **Average speech speed:** 4.14 characters per second
- **Voice Actor:** [張悦楷](https://zh.wikipedia.org/wiki/%E5%BC%A0%E6%82%A6%E6%A5%B7)
呢個係張悦楷講《三國演義》、《水滸傳》、《走進毛澤東的最後歲月》語音數據集。[張悦楷](https://zh.wikipedia.org/wiki/%E5%BC%A0%E6%82%A6%E6%A5%B7)係廣州最出名嘅講古佬 / 粵語説書藝人。佢從上世紀七十年代開始就喺廣東各個收音電台度講古,佢把聲係好多廣州人嘅共同回憶。本數據集收集嘅係佢最知名嘅三部作品。
數據集用途:
- TTS(語音合成)訓練集
- ASR(語音識別)訓練集或測試集
- 各種語言學、文學研究
- 直接聽嚟欣賞藝術!
TTS 效果演示:https://huggingface.co/spaces/laubonghaudoi/zoengjyutgaai_tts
## 説明
- 所有文本都根據 https://jyutping.org/blog/typo/ 同 https://jyutping.org/blog/particles/ 規範用字。
- 所有文本都使用全角標點,冇半角標點。
- 所有文本都用漢字轉寫,無阿拉伯數字無英文字母
- 所有音頻源都存放喺`/source`,為方便直接用作訓練數據,切分後嘅音頻都放喺 `opus/`
- 所有 opus 音頻皆為 48000 Hz 採樣率。
- 所有源字幕 SRT 文件都存放喺 `srt/` 路經下,搭配 `source/` 下嘅音源可以直接作為帶字幕嘅錄音直接欣賞。
- `cut.py` 係切分腳本,將對應嘅音源根據 srt 切分成短句並生成一個文本轉寫 csv。
- `stats.py` 係統計腳本,運行佢就會顯示成個數據集嘅各項統計數據。
## 下載使用
要下載使用呢個數據集,可以喺 Python 入面直接跑:
```python
from datasets import load_dataset
ds = load_dataset("CanCLID/zoengjyutgaai")
```
如果想單純將 `opus/` 入面所有嘢下載落嚟,可以跑下面嘅 Python 代碼,注意要安裝 `pip install --upgrade huggingface_hub` 先:
```python
from huggingface_hub import snapshot_download
# 如果淨係想下載啲字幕或者源音頻,就將 `opus/*` 改成 `srt/*` 或者 `source/*`
# If you only want to download subtitles or source audio, change `opus/*` to `srt/*` or `source/*`
snapshot_download(repo_id="CanCLID/zoengjyutgaai",allow_patterns="opus/*",local_dir="./",repo_type="dataset")
```
如果唔想用 python,你亦都可以用命令行叫 git 針對克隆個`opus/`或者其他路經,避免將成個 repo 都克隆落嚟浪費空間同下載時間:
```bash
mkdir zoengjyutgaai
cd zoengjyutgaai
git init
git remote add origin https://huggingface.co/datasets/CanCLID/zoengjyutgaai
git sparse-checkout init --cone
# 指定凈係下載個別路徑
git sparse-checkout set opus
# 開始下載
git pull origin main
```
### 數據集構建流程
本數據集嘅收集、構建過程係:
1. 從 YouTube 或者國內評書網站度下載錄音源文件,一般都係每集半個鐘長嘅 `.webm` 或者 `.mp3`。
1. 用加字幕工具幫呢啲錄音加字幕,得到對應嘅 `.srt` 文件。
1. 將啲源錄音用下面嘅命令儘可能無壓縮噉轉換成 `.opus` 格式。
1. 運行`cut.py`,將每一集 `.opus` 按照 `.srt` 入面嘅時間點切分成一句一個 `.opus`,然後對應嘅文本寫入本數據集嘅 `xxx.csv`。
1. 然後打開一個 IPython,逐句跑下面嘅命令,將啲數據推上 HuggingFace。
```python
from datasets import load_dataset, DatasetDict
from huggingface_hub import login
sg = load_dataset('audiofolder', data_dir='./opus/saamgwokjinji')
sw = load_dataset('audiofolder', data_dir='./opus/seoiwuzyun')
mzd = load_dataset('audiofolder', data_dir='./opus/mouzaakdung')
dataset = DatasetDict({
"saamgwokjinji": sg["train"],
"seoiwuzyun": sw["train"],
"mouzaakdung": mzd["train"],
})
# 檢查下讀入嘅數據有冇問題
dataset['mouzaakdung'][0]
# 準備好個 token 嚟登入
login()
# 推上 HuggingFace datasets
dataset.push_to_hub("CanCLID/zoengjyutgaai")
```
### 音頻格式轉換
首先要安裝 [ffmpeg](https://www.ffmpeg.org/download.html),然後運行:
```bash
# 將下載嘅音源由 webm 轉成 opus
ffmpeg -i webm/saamgwokjinji/001.webm -c:a copy source/saamgwokjinji/001.opus
# 或者轉 mp3
ffmpeg -i mp3/mouzaakdung/001.mp3 -c:a libopus -map_metadata -1 -b:a 48k -vbr on source/mouzaakdung/001.opus
# 將 opus 轉成無損 wav
ffmpeg -i source/saamgwokjinji/001.opus wav/saamgwokjinji/001.wav
```
如果想將所有 opus 文件全部轉換成 wav,可以直接運行`to_wav.sh`:
```
chmod +x to_wav.sh
./to_wav.sh
```
跟住就會生成一個 `wav/` 路經,入面都係 `opus/` 對應嘅音頻。注意 wav 格式非常掗埞,成個 `opus/` 轉晒後會佔用至少 500GB 儲存空間,所以轉換之前記得確保有足夠空間。如果你想對音頻重採樣,亦都可以修改 `to_wav.sh` 入面嘅命令順便做重採樣。
# The Zoeng Jyut Gaai Story-telling Speech Dataset
This is a speech dataset of Zoeng Jyut Gaai story-telling _Romance of the Three Kingdoms_, _Water Margin_ and _The Final Days of Mao Zedong_. [Zoeng Jyut Gaai](https://zh.wikipedia.org/wiki/%E5%BC%A0%E6%82%A6%E6%A5%B7) is a famous actor, stand-up commedian and story-teller (講古佬) in 20th centry Canton. His voice remains in the memories of thousands of Cantonese people. This dataset is built from three of his most well-known story-telling pieces.
Use case of this dataset:
- TTS (Text-To-Speech) training set
- ASR (Automatic Speech Recognition) training or eval set
- Various linguistics / art analysis
- Just listen and enjoy the art piece!
TTS demo: https://huggingface.co/spaces/laubonghaudoi/zoengjyutgaai_tts
## Introduction
- All transcriptions follow the prescribed orthography detailed in https://jyutping.org/blog/typo/ and https://jyutping.org/blog/particles/
- All transcriptions use full-width punctuations, no half-width punctuations is used.
- All transcriptions are in Chinese characters, no Arabic numbers or Latin letters.
- All source audio are stored in `source/`. For the convenice of training, segmented audios are stored in `opus/`.
- All opus audio are in 48000 Hz sampling rate.
- All source subtitle SRT files are stored in `srt/`. Use them with the webm files to enjoy subtitled storytelling pieces.
- `cut.py` is the script for cutting opus audios into senteneces based on the srt, and generates a csv file for transcriptions.
- `stats.py` is the script for getting stats of this dataset.
## Usage
To use this dataset, simply run in Python:
```python
from datasets import load_dataset
ds = load_dataset("CanCLID/zoengjyutgaai")
```
If you only want to download a certain directory to save time and space from cloning the entire repo, run the Python codes below. Make sure you have `pip install --upgrade huggingface_hub` first:
```python
from huggingface_hub import snapshot_download
# If you only want to download subtitles or source audio, change `opus/*` to `srt/*` or `source/*`
snapshot_download(repo_id="CanCLID/zoengjyutgaai",allow_patterns="opus/*",local_dir="./",repo_type="dataset")
```
If you don't want to run python codes and want to do this via command lines, you can selectively clone only a directory of the repo:
```bash
mkdir zoengjyutgaai
cd zoengjyutgaai
git init
git remote add origin https://huggingface.co/datasets/CanCLID/zoengjyutgaai
git sparse-checkout init --cone
# Tell git which directory you want
git sparse-checkout set opus
# Pull the content
git pull origin main
```
### Audio format conversion
Install [ffmpeg](https://www.ffmpeg.org/download.html) first, then run:
```bash
# convert all webm into opus
ffmpeg -i webm/saamgwokjinji/001.webm -c:a copy source/saamgwokjinji/001.opus
# or into mp3
ffmpeg -i mp3/mouzaakdung/001.mp3 -c:a libopus -map_metadata -1 -b:a 48k -vbr on source/mouzaakdung/001.opus
# convert all opus into loseless wav
ffmpeg -i source/saamgwokjinji/001.opus wav/saamgwokjinji/001.wav
```
If you want to convert all opus to wav, run `to_wav.sh`:
```
chmod +x to_wav.sh
./to_wav.sh
```
It will generate a `wav/` path which contains all audios converted from `opus/`. Be aware the wav format is very space-consuming. A full conversion will take up at least 500GB space so make sure you have enough storage. If you want to resample the audio, modify the line within `to_wav.sh` to resample the audio while doing the conversion.
# 张悦楷讲古语音数据集
[English](#the-zoeng-jyut-gaai-story-telling-speech-dataset)
## 数据集说明
- **主页**:[张悦楷讲古语音数据集(The Zoeng Jyut Gaai Story-telling Speech Dataset)](https://canclid.github.io/zoengjyutgaai/)
- **许可证**:[CC0 1.0 通用公共领域授权](https://creativecommons.org/publicdomain/zero/1.0/)
- **语言**:粤语
- **总时长**:112.54小时
- **单片段平均时长**:5.901秒
- **单片段时长中位数**:5.443秒
- **总字符数**:1679097
- **单片段平均字符数**:24.36
- **单片段字符数中位数**:23
- **平均语速**:4.14字符/秒
- **配音演员**:[张悦楷](https://zh.wikipedia.org/wiki/%E5%BC%A0%E6%82%A6%E6%A5%B7)
本数据集收录张悦楷讲述《三国演义》《水浒传》《走进毛泽东的最后岁月》的语音内容。张悦楷是广州最具知名度的讲古佬(粤语说书艺人),自20世纪70年代起便在广东各地广播电台讲古,其声音承载了众多广州人的集体记忆。本数据集甄选其三部最广为人知的说书作品进行收录。
### 数据集适用场景
- 文本到语音合成(TTS,Text-To-Speech)训练集
- 自动语音识别(ASR,Automatic Speech Recognition)训练集或测试集
- 各类语言学、文学研究
- 直接收听以欣赏曲艺艺术!
TTS效果演示:https://huggingface.co/spaces/laubonghaudoi/zoengjyutgaai_tts
## 说明
- 所有转录文本均遵循https://jyutping.org/blog/typo/与https://jyutping.org/blog/particles/规定的正字规范。
- 所有文本均使用全角标点符号,未使用半角标点。
- 所有转录文本均以汉字书写,未包含阿拉伯数字或拉丁字母。
- 所有原始音频文件存储于`source/`目录;为便于直接用于模型训练,切分后的音频片段存储于`opus/`目录。
- 所有opus格式音频均采用48000Hz采样率。
- 所有原始字幕SRT文件均存储于`srt/`目录,配合`source/`目录下的音频文件,可直接作为带字幕的录音进行收听欣赏。
- `cut.py`为音频切分脚本,可依据SRT文件将对应音频切分为短句,并生成转录文本CSV文件。
- `stats.py`为数据集统计脚本,运行后可输出本数据集的各项统计指标。
## 下载使用
若需下载并使用本数据集,可直接在Python环境中运行如下代码:
python
from datasets import load_dataset
ds = load_dataset("CanCLID/zoengjyutgaai")
若仅需下载特定目录以节省克隆整个仓库的时间与存储空间,可运行如下Python代码。请先执行`pip install --upgrade huggingface_hub`完成依赖安装:
python
from huggingface_hub import snapshot_download
# If you only want to download subtitles or source audio, change `opus/*` to `srt/*` or `source/*`
snapshot_download(repo_id="CanCLID/zoengjyutgaai",allow_patterns="opus/*",local_dir="./",repo_type="dataset")
若不愿使用Python,也可通过命令行工具Git仅克隆`opus/`或其他指定目录,避免克隆整个仓库以节省存储空间与下载时间:
bash
mkdir zoengjyutgaai
cd zoengjyutgaai
git init
git remote add origin https://huggingface.co/datasets/CanCLID/zoengjyutgaai
git sparse-checkout init --cone
# 指定仅下载个别路径
git sparse-checkout set opus
# 开始下载
git pull origin main
### 数据集构建流程
本数据集的收集与构建流程如下:
1. 从YouTube或国内评书网站下载原始录音文件,单集时长通常为半小时,格式多为`.webm`或`.mp3`。
2. 使用字幕添加工具为上述录音添加字幕,生成对应的`.srt`字幕文件。
3. 通过如下命令将原始录音尽可能无损地转换为`.opus`格式。
4. 运行`cut.py`脚本,依据`.srt`文件中的时间点将每一集`.opus`音频切分为单句音频片段,并将对应转录文本写入本数据集的`xxx.csv`文件中。
5. 启动IPython环境,逐行运行如下代码,将数据集上传至HuggingFace平台。
python
from datasets import load_dataset, DatasetDict
from huggingface_hub import login
sg = load_dataset('audiofolder', data_dir='./opus/saamgwokjinji')
sw = load_dataset('audiofolder', data_dir='./opus/seoiwuzyun')
mzd = load_dataset('audiofolder', data_dir='./opus/mouzaakdung')
dataset = DatasetDict({
"saamgwokjinji": sg["train"],
"seoiwuzyun": sw["train"],
"mouzaakdung": mzd["train"],
})
# 检查导入的数据是否正常
dataset['mouzaakdung'][0]
# 准备好令牌以完成登录
login()
# 上传至HuggingFace数据集平台
dataset.push_to_hub("CanCLID/zoengjyutgaai")
### 音频格式转换
首先需安装[ffmpeg](https://www.ffmpeg.org/download.html),随后运行如下命令:
bash
# 将下载的音源由webm转成opus
ffmpeg -i webm/saamgwokjinji/001.webm -c:a copy source/saamgwokjinji/001.opus
# 或者转mp3
ffmpeg -i mp3/mouzaakdung/001.mp3 -c:a libopus -map_metadata -1 -b:a 48k -vbr on source/mouzaakdung/001.opus
# 将opus转成无损wav
ffmpeg -i source/saamgwokjinji/001.opus wav/saamgwokjinji/001.wav
若需将所有opus格式文件转换为WAV格式,可直接运行`to_wav.sh`脚本:
chmod +x to_wav.sh
./to_wav.sh
运行后将生成`wav/`目录,其中包含`opus/`目录中音频对应的WAV格式文件。需注意WAV格式占用存储空间较大,完整转换`opus/`目录下的所有文件至少需要500GB存储空间,请在转换前确认具备足够的磁盘空间。若需对音频进行重采样,也可修改`to_wav.sh`脚本中的命令,在转换的同时完成重采样操作。
提供机构:
maas
创建时间:
2025-03-12



