MCoreOPU: An FPGA-based Multi-Core Overlay Processor for Transformer-based Models

Shaoqiang Lu, Tiandong Zhao, Ting Jung Lin, Rumin Zhang, Chen Wu, Lei He

Research output: Journal PublicationArticlepeer-review

Abstract

Transformer-based models have achieved extensive success with increasingly large numbers of parameters and computations, for which many multi-core accelerators have been developed. Nevertheless, they suffer from limited throughput due to either low operating frequency or high communication overhead between cores. This article proposes an FPGA-based multi-core overlay processor, named MCoreOPU, to optimize intra-core computation and inter-core communication. First, we boost the operating frequency of the processing element (PE) array to double the rest of the processor to improve the intra-core throughput. Second, we develop on-chip synchronization routers to reduce off-chip memory traffic, where only the partial sum and maximum are communicated between cores rather than entire vectors for layer normalization and softmax. Moreover, we pipeline synchronization to reduce synchronization latency and develop a bypass of the interconnect bus to reduce the off-chip memory access latency. Finally, we optimize the multi-core model allocation and scheduling to minimize the inter-core communications and maximize the intra-core computation efficiency. The MCoreOPU is implemented in 8-bit fixed-point precision with four cores and four DDRs on the Xilinx U200 FPGA, where the PE array runs at 600 MHz while the rest runs at 300 MHz. Experimental results show that the throughput per MAC of MCoreOPU for BERT, ViT, GPT-2, and LLaMA inference is 1.31×–7.18× higher than other FPGA-based accelerators. Compared with the A100 GPU, the throughput per equivalent MAC efficiency is improved by 22.52×–27.12×.

Original languageEnglish
Article number37
JournalACM Transactions on Reconfigurable Technology and Systems
Volume18
Issue number3
DOIs
Publication statusPublished - 19 Aug 2025
Externally publishedYes

Keywords

  • FPGA overlay processors
  • hardware accelerators
  • Multi-core architectures
  • synchronization routers

ASJC Scopus subject areas

  • General Computer Science

Fingerprint

Dive into the research topics of 'MCoreOPU: An FPGA-based Multi-Core Overlay Processor for Transformer-based Models'. Together they form a unique fingerprint.

Cite this