Greek Podcast Corpus (GPC)

Name: Greek Podcast Corpus (GPC)
Creator: 雅典研究与技术中心语言与语音处理研究所
Published: 2024-06-22 00:28:47
License: 暂无描述

arXiv2024-06-22 更新2024-06-25 收录

下载链接：

https://github.com/georgepar/greek_podcasts_asr

下载链接

链接失效反馈

官方服务：

资源简介：

Greek Podcast Corpus (GPC)是由雅典研究与技术中心语言与语音处理研究所创建的大型多领域现代希腊语播客数据集，旨在解决低资源语言在语音识别技术中的数据稀缺问题。该数据集包含3124小时的音频，涵盖16个不同的领域，通过使用WhisperX管道和Whisper large-v3模型进行分割和转录。GPC数据集的创建过程涉及使用网络爬虫收集RSS源，下载音频并转换为WAV格式。该数据集主要应用于自动语音识别（ASR）技术，通过弱监督学习方法提高ASR性能，特别是在模型大小和训练数据量增加时显示出显著的改进。

The Greek Podcast Corpus (GPC) is a large multi-domain modern Greek podcast corpus developed by the Institute of Language and Speech Processing at the Centre for Research and Technology Hellas in Athens. It was designed to address the data scarcity issue of low-resource languages in automatic speech recognition (ASR) technology. The corpus contains 3124 hours of audio content spanning 16 distinct domains, and was segmented and transcribed using the WhisperX pipeline and the Whisper large-v3 model. The construction process of GPC involves collecting RSS feeds via web crawlers, downloading the corresponding audio files, and converting them into WAV format. This dataset is primarily utilized for automatic speech recognition (ASR) tasks, where it helps enhance ASR performance through weakly supervised learning methods, with significant improvements observed particularly when the model size and training data volume are increased.

提供机构：

雅典研究与技术中心语言与语音处理研究所

创建时间：

2024-06-22

原始信息汇总

希腊播客语料库

文件结构

. ├── download_asr.sh # 下载音频从rss feeds ├── download_individual_rss_list.sh ├── download_tts.sh ├── README.md # 本文件 ├── kaldi_utils # 音频分割工具 ├── rss-lists # 包含下载的RSS feeds以供复现 │ ├── asr │ │ ├── Arts.txt │ │ ├── Business.txt │ │ ├── Comedy.txt │ │ ├── Education.txt │ │ ├── Government.txt │ │ ├── HealthFitness.txt │ │ ├── History.txt │ │ ├── KidsFamily.txt │ │ ├── Leisure.txt │ │ ├── Music.txt │ │ ├── News.txt │ │ ├── Science.txt │ │ ├── SocietyCulture.txt │ │ ├── Sports.txt │ │ ├── Technology.txt │ │ ├── TrueCrime.txt │ │ └── TVFilm.txt │ └── tts │ ├── audiobooks.txt │ └── political.txt ├── scrape_rss # 下载新RSS feeds的爬虫 └── scripts # 数据创建和预处理脚本 ├── create_subset.py ├── get_subset.py ├── hf_data_gen.py ├── sample.py ├── to_kaldi.py ├── train_dev_test_split.py └── transcribe.sh

收集RSS feeds

在rss-lists文件夹中，我们按任务（asr和tts）包含收集的RSS feeds。在asr文件夹中，feeds按领域划分。

我们还包含了一个scrapy爬虫，以便您可以收集更多的RSS feeds，位于scrape_rss文件夹中。

运行：

cd scrape_rss scrapy crawl parss -o output.json -a lang=el

从RSS feeds下载音频

运行：

download_asr.sh

数据准备脚本

步骤1：获取每个领域的随机子集（50小时）

mkdir -p gpc-50; python scripts/sample.py --input_folder $(pwd)/gpc --output_folder $(pwd)/gpc-50 --hours 50

步骤2：转录播客

bash scripts/transcribe.sh gpc-50

步骤3：创建训练-验证-测试分割

python scripts/train_dev_test_split.py --input_folder gpc-50 --output_folder gpc-50-all --dev_hours 0.3 --test_hours 1 --rename_sha --shuffle

步骤4：创建子集

mkdir gpc-50-all/gpc-20-train; python scripts/get_subset.py --input_folder gpc-50-all/train --output_folder gpc-50-all/gpc-20-train/ --hours 20 mkdir gpc-50-all/gpc-10-train; python scripts/get_subset.py --input_folder gpc-50-all/train --output_folder gpc-50-all/gpc-10-train/ --hours 10 mkdir gpc-50-all/gpc-5-train; python scripts/get_subset.py --input_folder gpc-50-all/train --output_folder gpc-50-all/gpc-5-train/ --hours 5 mkdir gpc-50-all/gpc-2-train; python scripts/get_subset.py --input_folder gpc-50-all/train --output_folder gpc-50-all/gpc-2-train/ --hours 2

步骤5：转换为kaldi格式

python scripts/to_kaldi.py gpc-50-all/train gpc-50-all/train_kaldi python scripts/to_kaldi.py gpc-50-all/test gpc-50-all/test_kaldi python scripts/to_kaldi.py gpc-50-all/dev gpc-50-all/dev_kaldi python scripts/to_kaldi.py gpc-50-all/gpc-20-train gpc-50-all/gpc20_train_kaldi python scripts/to_kaldi.py gpc-50-all/gpc-10-train gpc-50-all/gpc10_train_kaldi python scripts/to_kaldi.py gpc-50-all/gpc-5-train gpc-50-all/gpc5_train_kaldi python scripts/to_kaldi.py gpc-50-all/gpc-2-train gpc-50-all/gpc2_train_kaldi

步骤6：提取音频段（必须有有效的Kaldi安装 -> `export KALDI_PATH=/path/to/kaldi`）

cd kaldi_utils bash extract_wav_segments_data_dir_eager.sh ../gpc-50-all/train_kaldi ../gpc-50-all/train_kaldi_segmented ../gpc-50-all/train_segmented bash extract_wav_segments_data_dir_eager.sh ../gpc-50-all/test_kaldi ../gpc-50-all/test_kaldi_segmented ../gpc-50-all/test_segmented bash extract_wav_segments_data_dir_eager.sh ../gpc-50-all/dev_kaldi ../gpc-50-all/dev_kaldi_segmented ../gpc-50-all/dev_segmented bash extract_wav_segments_data_dir_eager.sh ../gpc-50-all/gpc20_train_kaldi ../gpc-50-all/gpc20_train_kaldi_segmented ../gpc-50-all/gpc20_train_segmented bash extract_wav_segments_data_dir_eager.sh ../gpc-50-all/gpc10_train_kaldi ../gpc-50-all/gpc10_train_kaldi_segmented ../gpc-50-all/gpc10_train_segmented bash extract_wav_segments_data_dir_eager.sh ../gpc-50-all/gpc5_train_kaldi ../gpc-50-all/gpc5_train_kaldi_segmented ../gpc-50-all/gpc5_train_segmented bash extract_wav_segments_data_dir_eager.sh ../gpc-50-all/gpc2_train_kaldi ../gpc-50-all/gpc2_train_kaldi_segmented ../gpc-50-all/gpc2_train_segmented

步骤5：转换为huggingface格式

python scripts/hf_data_gen.py

训练whisper模型

选择模型、子集并设置数据集路径，然后运行：

export MODEL=small # 或 medium export TRAINING_SUBSET=gpc50 # 或 gpc2, gpc5, gpc10, gpc20 export DATASET_PATH=$(pwd)/greek_podcast_dataset

cd training-scripts bash ft_whisper_${TRAINING_SUBSET}_${MODEL}.sh

在测试集上评估模型

对于common voice和fleurs

export CHECKPOINT_STEPS=3000 # 最新的检查点 cd training-scripts python decode_whisper_cv.py --processor ./whisper-${MODEL}-el-${TRAINING_SUBSET}-hf --model ./whisper-${MODEL}-el-${TRAINING_SUBSET}-hf/checkpoint-${CHECKPOINT_STEPS} --text-key sentence --dataset mozilla-foundation/common_voice_11_0 --lang el python decode_whisper_cv.py --processor ./whisper-${MODEL}-el-${TRAINING_SUBSET}-hf --model ./whisper-${MODEL}-el-${TRAINING_SUBSET}-hf/checkpoint-${CHECKPOINT_STEPS} --text-key transcription --dataset google/fleurs --lang el

对于hparl和logotypografia（假设您已下载并转换数据集为huggingface格式在`./hparl-test-hf`和`./logotypografia-test-hf`）

export CHECKPOINT_STEPS=3000 # 最新的检查点 cd training-scripts python decode_whisper_hplg.py --processor ./whisper-${MODEL}-el-${TRAINING_SUBSET}-hf --model ./whisper-${MODEL}-el-${TRAINING_SUBSET}-hf/checkpoint-${CHECKPOINT_STEPS} --text-key transcription --dataset ./hparl-test-hf --lang el python decode_whisper_hplg.py --processor ./whisper-${MODEL}-el-${TRAINING_SUBSET}-hf --model ./whisper-${MODEL}-el-${TRAINING_SUBSET}-hf/checkpoint-${CHECKPOINT_STEPS} --text-key transcription --dataset ./logotypografia-test-hf --lang el

对于希腊播客数据集

export CHECKPOINT_STEPS=3000 # 最新的检查点 cd training-scripts python decode_whisper_podcast.py --processor ./whisper-${MODEL}-el-${TRAINING_SUBSET}-hf --model ./whisper-${MODEL}-el-${TRAINING_SUBSET}-hf/checkpoint-${CHECKPOINT_STEPS} --text-key transcription --dataset ../greek_podcast_dataset/test --lang el

搜集汇总

数据集介绍

构建方式

Greek Podcast Corpus (GPC) 是通过收集现代希腊语的播客并使用 WhisperX 管道进行转录和分段来构建的。该管道包括语音活动检测、切割和合并、转录以及时间对齐四个步骤。收集到的音频首先转换为 WAV 格式，然后使用 WhisperX 管道进行处理，生成大约 30 秒的语音段和相应的转录文本。

使用方法

使用 GPC 的方法包括：首先，利用 WhisperX 管道对播客音频进行预处理，得到转录文本和语音段；其次，将转录文本和语音段用于训练、验证和测试 Whisper 模型；最后，在标准数据集上评估模型的性能。GPC 提供了不同大小的训练子集，以适应不同规模的模型训练需求。

背景与挑战

背景概述

Greek Podcast Corpus (GPC) 是由雅典研究中心的语言处理研究所创建的一个现代希腊语语音识别数据集。该数据集的构建旨在解决低资源语言在语音技术发展中所面临的挑战，特别是自动语音识别(ASR)的性能提升问题。GPC 数据集利用 WhisperX 管道对播客进行转录和分段，创建了一个包含 800 小时音频的多领域语料库，并在 16 个不同领域中进行了细致的分类。该数据集不仅为研究提供了丰富的资源，而且通过运用弱监督学习策略，有效地提升了低资源语言的 ASR 性能，为相关领域的研究提供了有力的支持。

当前挑战

在构建 GPC 数据集的过程中，研究人员面临了多个挑战。首先，由于现代希腊语数字资源有限，收集大量高质量的语音数据是一项艰巨的任务。其次，在数据预处理阶段，需要通过 WhisperX 管道对长格式音频进行有效的语音活动检测、切分和转录，以确保数据的准确性和可用性。此外，由于数据集中包含多个领域，跨领域的性能评估和模型泛化能力也是一个重要的挑战。最后，构建一个具有可靠评估标准的数据集分割对于验证模型性能至关重要。

常用场景

经典使用场景

Greek Podcast Corpus (GPC)被广泛应用于现代希腊语的语音识别研究，特别是对于构建和评估自动语音识别(ASR)系统。该数据集通过利用 WhisperX 工具对播客音频进行切分和转录，为研究人员提供了一个大规模、多领域的现代希腊语音频资源，可用于训练和测试 ASR 模型。

解决学术问题

GPC 解决了低资源语言中 ASR 系统训练数据不足的问题。通过利用 Whisper 大型预训练模型对播客数据进行弱监督转录，该数据集为现代希腊语提供了丰富的语音数据，使得研究人员能够在低资源条件下训练出性能良好的 ASR 模型，并进行了跨领域和跨数据集的评估，验证了这种方法的有效性。

实际应用

在实际应用中，GPC 可用于提升现代希腊语的 ASR 系统性能，尤其是在数据稀缺的低资源语言环境中。它为开发语音识别技术提供了宝贵的数据资源，有助于推动相关技术的发展和应用。

数据集最近研究