Greek Podcast Corpus (GPC)
收藏希腊播客语料库
文件结构
. ├── download_asr.sh # 下载音频从rss feeds ├── download_individual_rss_list.sh ├── download_tts.sh ├── README.md # 本文件 ├── kaldi_utils # 音频分割工具 ├── rss-lists # 包含下载的RSS feeds以供复现 │ ├── asr │ │ ├── Arts.txt │ │ ├── Business.txt │ │ ├── Comedy.txt │ │ ├── Education.txt │ │ ├── Government.txt │ │ ├── HealthFitness.txt │ │ ├── History.txt │ │ ├── KidsFamily.txt │ │ ├── Leisure.txt │ │ ├── Music.txt │ │ ├── News.txt │ │ ├── Science.txt │ │ ├── SocietyCulture.txt │ │ ├── Sports.txt │ │ ├── Technology.txt │ │ ├── TrueCrime.txt │ │ └── TVFilm.txt │ └── tts │ ├── audiobooks.txt │ └── political.txt ├── scrape_rss # 下载新RSS feeds的爬虫 └── scripts # 数据创建和预处理脚本 ├── create_subset.py ├── get_subset.py ├── hf_data_gen.py ├── sample.py ├── to_kaldi.py ├── train_dev_test_split.py └── transcribe.sh
收集RSS feeds
在rss-lists文件夹中,我们按任务(asr和tts)包含收集的RSS feeds。在asr文件夹中,feeds按领域划分。
我们还包含了一个scrapy爬虫,以便您可以收集更多的RSS feeds,位于scrape_rss文件夹中。
运行:
cd scrape_rss scrapy crawl parss -o output.json -a lang=el
从RSS feeds下载音频
运行:
download_asr.sh
数据准备脚本
步骤1:获取每个领域的随机子集(50小时)
mkdir -p gpc-50; python scripts/sample.py --input_folder $(pwd)/gpc --output_folder $(pwd)/gpc-50 --hours 50
步骤2:转录播客
bash scripts/transcribe.sh gpc-50
步骤3:创建训练-验证-测试分割
python scripts/train_dev_test_split.py --input_folder gpc-50 --output_folder gpc-50-all --dev_hours 0.3 --test_hours 1 --rename_sha --shuffle
步骤4:创建子集
mkdir gpc-50-all/gpc-20-train; python scripts/get_subset.py --input_folder gpc-50-all/train --output_folder gpc-50-all/gpc-20-train/ --hours 20 mkdir gpc-50-all/gpc-10-train; python scripts/get_subset.py --input_folder gpc-50-all/train --output_folder gpc-50-all/gpc-10-train/ --hours 10 mkdir gpc-50-all/gpc-5-train; python scripts/get_subset.py --input_folder gpc-50-all/train --output_folder gpc-50-all/gpc-5-train/ --hours 5 mkdir gpc-50-all/gpc-2-train; python scripts/get_subset.py --input_folder gpc-50-all/train --output_folder gpc-50-all/gpc-2-train/ --hours 2
步骤5:转换为kaldi格式
python scripts/to_kaldi.py gpc-50-all/train gpc-50-all/train_kaldi python scripts/to_kaldi.py gpc-50-all/test gpc-50-all/test_kaldi python scripts/to_kaldi.py gpc-50-all/dev gpc-50-all/dev_kaldi python scripts/to_kaldi.py gpc-50-all/gpc-20-train gpc-50-all/gpc20_train_kaldi python scripts/to_kaldi.py gpc-50-all/gpc-10-train gpc-50-all/gpc10_train_kaldi python scripts/to_kaldi.py gpc-50-all/gpc-5-train gpc-50-all/gpc5_train_kaldi python scripts/to_kaldi.py gpc-50-all/gpc-2-train gpc-50-all/gpc2_train_kaldi
步骤6:提取音频段(必须有有效的Kaldi安装 -> export KALDI_PATH=/path/to/kaldi)
cd kaldi_utils bash extract_wav_segments_data_dir_eager.sh ../gpc-50-all/train_kaldi ../gpc-50-all/train_kaldi_segmented ../gpc-50-all/train_segmented bash extract_wav_segments_data_dir_eager.sh ../gpc-50-all/test_kaldi ../gpc-50-all/test_kaldi_segmented ../gpc-50-all/test_segmented bash extract_wav_segments_data_dir_eager.sh ../gpc-50-all/dev_kaldi ../gpc-50-all/dev_kaldi_segmented ../gpc-50-all/dev_segmented bash extract_wav_segments_data_dir_eager.sh ../gpc-50-all/gpc20_train_kaldi ../gpc-50-all/gpc20_train_kaldi_segmented ../gpc-50-all/gpc20_train_segmented bash extract_wav_segments_data_dir_eager.sh ../gpc-50-all/gpc10_train_kaldi ../gpc-50-all/gpc10_train_kaldi_segmented ../gpc-50-all/gpc10_train_segmented bash extract_wav_segments_data_dir_eager.sh ../gpc-50-all/gpc5_train_kaldi ../gpc-50-all/gpc5_train_kaldi_segmented ../gpc-50-all/gpc5_train_segmented bash extract_wav_segments_data_dir_eager.sh ../gpc-50-all/gpc2_train_kaldi ../gpc-50-all/gpc2_train_kaldi_segmented ../gpc-50-all/gpc2_train_segmented
步骤5:转换为huggingface格式
python scripts/hf_data_gen.py
训练whisper模型
选择模型、子集并设置数据集路径,然后运行:
export MODEL=small # 或 medium export TRAINING_SUBSET=gpc50 # 或 gpc2, gpc5, gpc10, gpc20 export DATASET_PATH=$(pwd)/greek_podcast_dataset
cd training-scripts bash ft_whisper_${TRAINING_SUBSET}_${MODEL}.sh
在测试集上评估模型
对于common voice和fleurs
export CHECKPOINT_STEPS=3000 # 最新的检查点 cd training-scripts python decode_whisper_cv.py --processor ./whisper-${MODEL}-el-${TRAINING_SUBSET}-hf --model ./whisper-${MODEL}-el-${TRAINING_SUBSET}-hf/checkpoint-${CHECKPOINT_STEPS} --text-key sentence --dataset mozilla-foundation/common_voice_11_0 --lang el python decode_whisper_cv.py --processor ./whisper-${MODEL}-el-${TRAINING_SUBSET}-hf --model ./whisper-${MODEL}-el-${TRAINING_SUBSET}-hf/checkpoint-${CHECKPOINT_STEPS} --text-key transcription --dataset google/fleurs --lang el
对于hparl和logotypografia(假设您已下载并转换数据集为huggingface格式在./hparl-test-hf和./logotypografia-test-hf)
export CHECKPOINT_STEPS=3000 # 最新的检查点 cd training-scripts python decode_whisper_hplg.py --processor ./whisper-${MODEL}-el-${TRAINING_SUBSET}-hf --model ./whisper-${MODEL}-el-${TRAINING_SUBSET}-hf/checkpoint-${CHECKPOINT_STEPS} --text-key transcription --dataset ./hparl-test-hf --lang el python decode_whisper_hplg.py --processor ./whisper-${MODEL}-el-${TRAINING_SUBSET}-hf --model ./whisper-${MODEL}-el-${TRAINING_SUBSET}-hf/checkpoint-${CHECKPOINT_STEPS} --text-key transcription --dataset ./logotypografia-test-hf --lang el
对于希腊播客数据集
export CHECKPOINT_STEPS=3000 # 最新的检查点 cd training-scripts python decode_whisper_podcast.py --processor ./whisper-${MODEL}-el-${TRAINING_SUBSET}-hf --model ./whisper-${MODEL}-el-${TRAINING_SUBSET}-hf/checkpoint-${CHECKPOINT_STEPS} --text-key transcription --dataset ../greek_podcast_dataset/test --lang el

- 1The Greek podcast corpus: Competitive speech models for low-resourced languages with weakly supervised data雅典研究与技术中心语言与语音处理研究所 · 2024年



