TY - GEN
T1 - An FPGA-based Multi-Core Overlay Processor for Transformer-based Models
AU - Lu, Shaoqiang
AU - Zhao, Tiandong
AU - Zhang, Rumin
AU - Lin, Ting Jung
AU - Wu, Chen
AU - He, Lei
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Transformer-based models have achieved extensive success with increasingly large numbers of parameters and computations, for which many multi-core accelerators have been developed. Nevertheless, they suffer from limited throughput due to either low operating frequency or high communication overhead between cores. This paper proposes an FPGA-based multi-core overlay processor, MCore-OPU, to optimize intra-core computation and inter-core communication. First, we boost the operating frequency of the processing element (PE) array to dou-ble the rest of the processor to improve the intra-core throughput. Second, we develop on-chip synchronization routers to reduce expensive off-chip memory traffic, where only the partial sum and maximum are communicated between cores rather than entire vectors for layer normalization and softmax. Meanwhile, we optimize the multi-core model allocation and scheduling to minimize the inter-core communications and maximize the intra-core computation efficiency. The MCore-OPU is implemented with four cores and four DDRs on the Xilinx U200 FPGA, where the PE array runs 600MHz, and the rest runs 300MHz. Experimental results show that the MCore-OPU outperforms other FPGA-based accelerators by 1.24x-l.39x and A100 GPU by 5.31x-5.81x in throughput per DSP for BERT, ViT, GPT-2 and LLaMA inference, respectively.
AB - Transformer-based models have achieved extensive success with increasingly large numbers of parameters and computations, for which many multi-core accelerators have been developed. Nevertheless, they suffer from limited throughput due to either low operating frequency or high communication overhead between cores. This paper proposes an FPGA-based multi-core overlay processor, MCore-OPU, to optimize intra-core computation and inter-core communication. First, we boost the operating frequency of the processing element (PE) array to dou-ble the rest of the processor to improve the intra-core throughput. Second, we develop on-chip synchronization routers to reduce expensive off-chip memory traffic, where only the partial sum and maximum are communicated between cores rather than entire vectors for layer normalization and softmax. Meanwhile, we optimize the multi-core model allocation and scheduling to minimize the inter-core communications and maximize the intra-core computation efficiency. The MCore-OPU is implemented with four cores and four DDRs on the Xilinx U200 FPGA, where the PE array runs 600MHz, and the rest runs 300MHz. Experimental results show that the MCore-OPU outperforms other FPGA-based accelerators by 1.24x-l.39x and A100 GPU by 5.31x-5.81x in throughput per DSP for BERT, ViT, GPT-2 and LLaMA inference, respectively.
KW - FPGA Overlay Processor
KW - Multi-Core
KW - Synchronization Router
KW - Transformer
UR - https://www.scopus.com/pages/publications/85201730646
U2 - 10.1109/ISEDA62518.2024.10617729
DO - 10.1109/ISEDA62518.2024.10617729
M3 - Conference contribution
AN - SCOPUS:85201730646
T3 - 2024 International Symposium of Electronics Design Automation, ISEDA 2024
SP - 697
EP - 702
BT - 2024 International Symposium of Electronics Design Automation, ISEDA 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 International Symposium of Electronics Design Automation, ISEDA 2024
Y2 - 10 May 2024 through 13 May 2024
ER -