Partial Overlap Benchmark (POB-Spark/POB-LP)

Name: Partial Overlap Benchmark (POB-Spark/POB-LP)
Creator: 加州大学圣地亚哥分校·电气与计算机工程系; 博士公司
Published: 2026-02-10 01:32:03
License: 暂无描述

arXiv2026-02-10 更新2026-02-11 收录

下载链接：

https://github.com/cijinsama/Partial-Overlap-Benchmark

下载链接

链接失效反馈

官方服务：

资源简介：

Partial Overlap Benchmark是由加州大学圣地亚哥分校和博士公司联合构建的多模态语音文本数据集，包含POB-Spark和POB-LP两个子集。POB-LP基于LibriPhrase扩展生成，通过追加高频英语词构建前缀重叠样本；POB-Spark采用Spark-TTS语音合成技术生成，包含平衡的短语长度分布。数据集总规模达27万条，专门用于解决开放词汇关键词检测中前缀偏置问题，通过模拟真实场景中的部分匹配语音指令（如'turn the light on/off'），推动语音交互系统在边缘设备上的鲁棒性研究。

The Partial Overlap Benchmark (POB) is a multimodal speech-text dataset co-developed by the University of California, San Diego and Doctor Inc., which includes two subsets: POB-Spark and POB-LP. POB-LP is extended based on LibriPhrase, and is constructed by appending high-frequency English words to generate prefix-overlapping samples. POB-Spark is generated using Spark-TTS speech synthesis technology, and features a balanced phrase length distribution. With a total size of 270,000 entries, this dataset is specifically designed to address the prefix bias issue in open-vocabulary keyword detection. By simulating partially matched speech commands in real-world scenarios, such as "turn the light on/off", it aims to advance the research on the robustness of speech interaction systems on edge devices.

提供机构：

加州大学圣地亚哥分校·电气与计算机工程系; 博士公司

创建时间：

2026-02-10

原始信息汇总

Partial-Overlap-Benchmark (POB) 数据集概述

数据集简介

Partial Overlap Benchmark (POB) 是一个用于评估开放词汇关键词检测（OV-KWS）模型的数据集，专门针对查询词条与注册词条共享重叠前缀但并非完全相同的场景（例如，“turn the light on” 与 “turn the light off”）。该数据集旨在解决现有基准（如LibriPhrase和Google Speech Commands）很少测试此类情况所导致的模型前缀偏差问题。

数据集构成

POB由两个互补的数据集组成：

POB-LibriPhrase (POB-LP)：通过附加常见英语单词以强制前缀重叠，从LibriPhrase派生而来。
POB-Spark：使用Spark-TTS模型生成的合成语料库，旨在提供跨不同说话者特征的受控重叠模式。

数据示例与定义

POB案例说明

注册词条	重叠查询词条	结果
turn the light on	turn the light off	前缀易混淆
service	survive	非前缀易混淆
play the music	stop the music	非前缀易混淆

数据加载示例

从Hugging Face加载数据后，单个数据样本结构示例如下： python { query_audio: {path: 5405-121045-0041_0.wav, array: ..., sampling_rate: 16000}, query_audio_textgrid: YOUR_PATH/LibriPhrase/.../5405-121045-0041_0.TextGrid, query_text: [42, 5, 54, 35, 41], query_transcript: carriage, query_text_len: 5, anchor_audio: {path: 1006-135212-0068_0.wav, array: ..., sampling_rate: 16000}, anchor_audio_textgrid: YOUR_PATH/LibriPhrase/.../1006-135212-0068_0.TextGrid, anchor_text: [42, 5, 54, 35, 41, 42, 14, 45, 57, 35, 46], anchor_transcript: carriage counting, anchor_text_len: 11, match_label: 0 }

数据获取与生成

直接加载

可从Hugging Face仓库 RiceHunger/Partial-Overlap-Benchmark 直接加载处理好的数据集。

从零生成

环境准备

需要Python 3.12.0环境，并安装requirements.txt中的依赖。

数据准备

准备LibriPhrase：按照其官方说明准备数据，得到包含train_500、train_360、train_100等子目录的结构。
生成POB-Spark数据集：
- 运行脚本生成元数据：python prepare_meta.py --num_perposition 100 --num_pairs 50000 --max_len 25 --output meta_text.csv
- 使用Spark-TTS合成音频：cd Spark-TTS; python spark_generate.py ../meta_text.csv YOUR_PRETRAINED_MODEL_DIR/Spark-TTS-0.5B YOUR_SAVE_DIR/valid
- 使用MFA进行强制对齐：mfa align YOUR_SAVE_DIR/valid english_us_arpa english_us_arpa YOUR_SAVE_DIR/valid --single_speaker
- 最终目录结构应包含meta.csv、.txt、.wav、.TextGrid等文件。
加载数据：更新配置文件configs/data/*.yaml和dataloader.py中的文件路径，然后运行python example.py加载数据（POB-LP将在运行时生成）。

参考文献与引用

引用格式

@inproceedings{liu2026pob, author={Yi Liu and Chuan-Che Huang and Xiao Quan}, title={No Word Left Behind: Mitigating Prefix Bias in Open-Vocabulary Keyword Spotting}, booktitle={ICASSP 2026}, year={2026} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集