five

Chinese Speech Recognition Using Conformer Fused with Max Pooling

收藏
中国科学数据2026-01-19 更新2026-04-25 收录
下载链接:
https://www.sciengine.com/AA/doi/10.19678/j.issn.1000-3428.0070055
下载链接
链接失效反馈
官方服务:
资源简介:
Speech recognition technology enables machines to understand human speech using advanced algorithms and signal processing technologies, thereby making communication between humans and machines more convenient. Most existing studies on end-to-end speech recognition focus on optimizing the Conformer model. The Conformer encoder suffers from the issue of insufficient extraction of fine-grained local speech features. To resolve these issues, this study proposes a Chinese speech recognition method based on Max Pooling (MP). First, the output of the gated linear unit in the convolutional module of the encoder is max-pooled along the time dimension to extract fine-grained local features corresponding to the characteristics of multiple speech signal frames. Second, these pooled features are fused with the coarse-grained local features extracted via Depthwise Convolution (DWC) using the element-wise sum method to increase the amount of information on local speech features and improve the speech recognition accuracy of the Conformer model. The experimental results on the public Chinese dataset Aishell-1 show that the improved model can reduce the Character Error Rate (CER) of the baseline model from 5.58% to 5.32% and from 5.06% to 4.92% by decoding using greedy search and attention rescoring, respectively.
创建时间:
2026-01-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作