While FPGA has been recognized as a promising platform to accelerate Convolutional Neural Networks (CNNs) in embedded computing given its high flexibility and power efficiency, two challenges still have to be addressed to enhance its applicability on the edge-computing paradigm. First, the power and performance of the CNN accelerator are still bounded by memory throughput, and a CNN-customized architecture is desirable to fully utilize the on-chip storage. Second, power optimization algorithms are insufficiently explored on CNN-targeted platforms. In this paper, we design a novel FPGA-based CNN accelerator architecture that makes full use of the on-chip storage resources leveraging data reuse and loop unrolling strategies. We also present an efficient FPGA-based voltage and frequency scaling (VFS) system that enables VFS of the CNN accelerator for power optimization. We devise a VFS policy that fully exploits the power efficiency potential of the FPGA. Experiment results show up to 40% energy can be saved with our VFS platform and policy.