Chinese Speech Recognition Using Conformer Fused with Max Pooling

中国科学数据2026-01-19 更新2026-04-25 收录

下载链接：

https://www.sciengine.com/AA/doi/10.19678/j.issn.1000-3428.0070055

下载链接

链接失效反馈

官方服务：

资源简介：

Speech recognition technology enables machines to understand human speech using advanced algorithms and signal processing technologies, thereby making communication between humans and machines more convenient. Most existing studies on end-to-end speech recognition focus on optimizing the Conformer model. The Conformer encoder suffers from the issue of insufficient extraction of fine-grained local speech features. To resolve these issues, this study proposes a Chinese speech recognition method based on Max Pooling (MP). First, the output of the gated linear unit in the convolutional module of the encoder is max-pooled along the time dimension to extract fine-grained local features corresponding to the characteristics of multiple speech signal frames. Second, these pooled features are fused with the coarse-grained local features extracted via Depthwise Convolution (DWC) using the element-wise sum method to increase the amount of information on local speech features and improve the speech recognition accuracy of the Conformer model. The experimental results on the public Chinese dataset Aishell-1 show that the improved model can reduce the Character Error Rate (CER) of the baseline model from 5.58% to 5.32% and from 5.06% to 4.92% by decoding using greedy search and attention rescoring, respectively.

创建时间：

2026-01-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集