TY - JOUR
T1 - MCoreOPU
T2 - An FPGA-based Multi-Core Overlay Processor for Transformer-based Models
AU - Lu, Shaoqiang
AU - Zhao, Tiandong
AU - Lin, Ting Jung
AU - Zhang, Rumin
AU - Wu, Chen
AU - He, Lei
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2025/8/19
Y1 - 2025/8/19
N2 - Transformer-based models have achieved extensive success with increasingly large numbers of parameters and computations, for which many multi-core accelerators have been developed. Nevertheless, they suffer from limited throughput due to either low operating frequency or high communication overhead between cores. This article proposes an FPGA-based multi-core overlay processor, named MCoreOPU, to optimize intra-core computation and inter-core communication. First, we boost the operating frequency of the processing element (PE) array to double the rest of the processor to improve the intra-core throughput. Second, we develop on-chip synchronization routers to reduce off-chip memory traffic, where only the partial sum and maximum are communicated between cores rather than entire vectors for layer normalization and softmax. Moreover, we pipeline synchronization to reduce synchronization latency and develop a bypass of the interconnect bus to reduce the off-chip memory access latency. Finally, we optimize the multi-core model allocation and scheduling to minimize the inter-core communications and maximize the intra-core computation efficiency. The MCoreOPU is implemented in 8-bit fixed-point precision with four cores and four DDRs on the Xilinx U200 FPGA, where the PE array runs at 600 MHz while the rest runs at 300 MHz. Experimental results show that the throughput per MAC of MCoreOPU for BERT, ViT, GPT-2, and LLaMA inference is 1.31×–7.18× higher than other FPGA-based accelerators. Compared with the A100 GPU, the throughput per equivalent MAC efficiency is improved by 22.52×–27.12×.
AB - Transformer-based models have achieved extensive success with increasingly large numbers of parameters and computations, for which many multi-core accelerators have been developed. Nevertheless, they suffer from limited throughput due to either low operating frequency or high communication overhead between cores. This article proposes an FPGA-based multi-core overlay processor, named MCoreOPU, to optimize intra-core computation and inter-core communication. First, we boost the operating frequency of the processing element (PE) array to double the rest of the processor to improve the intra-core throughput. Second, we develop on-chip synchronization routers to reduce off-chip memory traffic, where only the partial sum and maximum are communicated between cores rather than entire vectors for layer normalization and softmax. Moreover, we pipeline synchronization to reduce synchronization latency and develop a bypass of the interconnect bus to reduce the off-chip memory access latency. Finally, we optimize the multi-core model allocation and scheduling to minimize the inter-core communications and maximize the intra-core computation efficiency. The MCoreOPU is implemented in 8-bit fixed-point precision with four cores and four DDRs on the Xilinx U200 FPGA, where the PE array runs at 600 MHz while the rest runs at 300 MHz. Experimental results show that the throughput per MAC of MCoreOPU for BERT, ViT, GPT-2, and LLaMA inference is 1.31×–7.18× higher than other FPGA-based accelerators. Compared with the A100 GPU, the throughput per equivalent MAC efficiency is improved by 22.52×–27.12×.
KW - FPGA overlay processors
KW - hardware accelerators
KW - Multi-core architectures
KW - synchronization routers
UR - https://www.scopus.com/pages/publications/105018468557
U2 - 10.1145/3742437
DO - 10.1145/3742437
M3 - Article
AN - SCOPUS:105018468557
SN - 1936-7406
VL - 18
JO - ACM Transactions on Reconfigurable Technology and Systems
JF - ACM Transactions on Reconfigurable Technology and Systems
IS - 3
M1 - 37
ER -