Registration of multimodal retinal images is of great importance in facilitating the diagnosis and treatment of many eye diseases, such as the registration between color fundus images and optical coherence tomography (OCT) images. However, it is difficult to obtain ground truth, and most existing algorithms are for rigid registration without considering the optical distortion. In this paper, we present an unsupervised learning method for deformable registration between the two images. To solve the registration problem, the structure achieves a multi-level receptive field and takes contour and local detail into account. To measure the edge difference caused by different distortions in the optics center and edge, an edge similarity (ES) loss term is proposed, so loss function is composed by local cross-correlation, edge similarity and diffusion regularizer on the spatial gradients of the deformation matrix. Thus, we propose a multi-scale input layer, U-net with dilated convolution structure, squeeze excitation (SE) block and spatial transformer layers. Quantitative experiments prove the proposed framework is best compared with several conventional and deep learningbased methods, and our ES loss and structure combined with Unet and multi-scale layers achieve competitive results for normal and abnormal images.