TY - GEN
T1 - MambaOPU
T2 - 62nd ACM/IEEE Design Automation Conference, DAC 2025
AU - Lu, Shaoqiang
AU - Yu, Xuliang
AU - Zhao, Tiandong
AU - Miao, Siyuan
AU - Sheng, Xinsong
AU - Wu, Chen
AU - Zhao, Liang
AU - Lin, Ting Jung
AU - He, Lei
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - State-space models (SSMs), such as Mamba, have emerged as a promising alternative to Transformers. However, the recently developed Mamba2, based on state space duality (SSD), is highly memorybound and suffers from limited computation efficiency. This inefficiency arises from its irregular broadcast element-wise multiplications and structured sparse computations. In this work, we propose MambaOPU, an FPGA overlay processor, to accelerate SSD. First, to reduce memory overhead, we introduce a software-hardware co-optimized operator fusion framework. Specifically, operator merging combines adjacent broadcast multiplication and summation operations into a single descriptor, while operator backward shifting embeds segment multiplication into subsequent operations. Both techniques shorten the computation path and improve computation efficiency. Second, to enhance sparse computation efficiency, we skip zero-region computations using a tensor-reorder-and-group algorithm combined with a sparse-predefined data fetcher. Additionally, since Mamba integrates linear operations with SSD, we develop a reconfigurable systolic array to improve data reuse across different computation modes. Extensive experiment results demonstrate that MambaOPU achieves up to 1812 × and 880.79 × higher normalized throughput and up to 12908 × and 24.27 × higher energy efficiency over Intel Xeon Gold 6348 CPU and NVIDIA A100 GPU, respectively.
AB - State-space models (SSMs), such as Mamba, have emerged as a promising alternative to Transformers. However, the recently developed Mamba2, based on state space duality (SSD), is highly memorybound and suffers from limited computation efficiency. This inefficiency arises from its irregular broadcast element-wise multiplications and structured sparse computations. In this work, we propose MambaOPU, an FPGA overlay processor, to accelerate SSD. First, to reduce memory overhead, we introduce a software-hardware co-optimized operator fusion framework. Specifically, operator merging combines adjacent broadcast multiplication and summation operations into a single descriptor, while operator backward shifting embeds segment multiplication into subsequent operations. Both techniques shorten the computation path and improve computation efficiency. Second, to enhance sparse computation efficiency, we skip zero-region computations using a tensor-reorder-and-group algorithm combined with a sparse-predefined data fetcher. Additionally, since Mamba integrates linear operations with SSD, we develop a reconfigurable systolic array to improve data reuse across different computation modes. Extensive experiment results demonstrate that MambaOPU achieves up to 1812 × and 880.79 × higher normalized throughput and up to 12908 × and 24.27 × higher energy efficiency over Intel Xeon Gold 6348 CPU and NVIDIA A100 GPU, respectively.
UR - https://www.scopus.com/pages/publications/105017772529
U2 - 10.1109/DAC63849.2025.11132895
DO - 10.1109/DAC63849.2025.11132895
M3 - Conference contribution
AN - SCOPUS:105017772529
T3 - Proceedings - Design Automation Conference
BT - 2025 62nd ACM/IEEE Design Automation Conference, DAC 2025
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 22 June 2025 through 25 June 2025
ER -