Design of a CNN Accelerator Based on Systolic Array Collaboration with Inter-Layer Fusion
收藏中国科学数据2026-04-16 更新2026-04-25 收录
下载链接:
https://www.sciengine.com/AA/doi/10.11999/JEIT250867
下载链接
链接失效反馈官方服务:
资源简介:
ObjectiveWith the rapid deployment of deep learning in edge computing, the demand for efficient Convolutional Neural Network (CNN) accelerators continues to increase. Although traditional CPUs and GPUs provide strong computational capability, they incur high power consumption, long latency, and limited scalability in real-time embedded scenarios. FPGA-based accelerators, due to their reconfigurability and parallelism, provide a viable alternative. However, current designs often show low resource utilization, memory access bottlenecks, and difficulty in balancing throughput and energy efficiency. To address these issues, a systolic array–based CNN accelerator with inter-layer fusion optimization is proposed. The design integrates an enhanced memory hierarchy and optimized computation scheduling. Hardware-oriented convolution mapping and lightweight quantization are adopted to improve computational efficiency and reduce resource consumption, while meeting real-time inference requirements for applications such as intelligent surveillance and autonomous driving.MethodsThis study addresses core challenges in FPGA-based CNN accelerators, including data transfer overhead, insufficient resource utilization, and low processing unit efficiency. A hybrid accelerator architecture based on systolic array–assisted inter-layer fusion is proposed. Computation-intensive adjacent layers are tightly coupled and executed sequentially within a single systolic array, which reduces frequent off-chip memory accesses for intermediate results. This reduces data transfer overhead and power consumption and improves computation speed and overall energy efficiency. A dynamically reconfigurable systolic array is further developed to support multi-dimensional matrix multiplications with varying scales. This design avoids resource waste caused by fixed-function hardware and reduces FPGA logic consumption, thereby improving hardware adaptability and flexibility. A streaming systolic array computation scheme is also introduced through coordinated computation flow and control logic. Processing elements maintain a high-efficiency operating state, and data flows continuously through the computation engine in a pipelined and parallel manner. This improves processing unit utilization, reduces idle cycles, and increases overall throughput.Results and DiscussionsTo determine appropriate quantization precision, experiments are conducted on the MNIST dataset using VGG16 and ResNet50 under fixed-point quantization with 12-bit, 10-bit, 8-bit, and 6-bit precision. As shown in Table 1, inference accuracy decreases significantly when precision falls below 8 bits, indicating that excessively low precision weakens model representational capacity. On the proposed accelerator, VGG16, ResNet50, and YOLOv8n achieve peak computational performances of 390.25 GOPS, 360.27 GOPS, and 348.08 GOPS, respectively. Performance comparisons with FPGA accelerators reported in the literature are summarized in Table 4. Table 5 presents comparisons with CPU and GPU platforms in terms of throughput and energy efficiency. For VGG16, ResNet50, and YOLOv8n, the proposed accelerator delivers throughput that is 1.76×, 3.99×, and 2.61× higher than the corresponding CPU platforms. Energy efficiency improves by 3.1× (VGG16), 2.64× (ResNet50), and 2.96× (YOLOv8n) compared with GPU platforms, demonstrating superior energy utilization.ConclusionsA systolic array–assisted inter-layer fusion CNN accelerator architecture is proposed. A theoretical analysis of computational density confirms the performance advantages of the design. To address variation in convolution window sizes in the second layer, a dynamically reconfigurable systolic array method is developed. A streaming systolic array scheme is also implemented to sustain pipelined and parallel data flow within the computation engine. This design reduces idle cycles and improves throughput. Experimental results show that the accelerator achieves high computational performance with minimal loss in inference accuracy. Peak performances of 390.25 GOPS, 360.27 GOPS, and 348.08 GOPS are achieved for VGG16, ResNet50, and YOLOv8n, respectively. Compared with CPU and GPU platforms, the proposed accelerator shows superior energy efficiency and is suitable for resource-constrained and energy-sensitive edge computing scenarios.
创建时间:
2026-04-16



