Diabetic retinopathy (DR), a complication due to diabetes, is a common cause of progressive damage to the retina. The mass screening of populations for DR is time-consuming. Therefore, computerized diagnosis is of great significance in the clinical practice, which providing evidence to assist clinicians in decision making. Specifically, hemorrhages, microaneurysms, hard exudates, soft exudates, and other lesions are verified to be closely associated with DR. These lesions, however, are scattered in different positions and sizes in fundus images, the internal relation of which are hard to be reserved in the ultimate features due to a large number of convolution layers that reduce the detail characteristics. In this paper, we present a deep-learning network with a multi-scale self-attention module to aggregate the global context to learned features for DR image retrieval. The multi-scale fusion enhances, in terms of scale, the efficacious latent relation of different positions in features explored by the self-attention. For the experiment, the proposed network is validated on the Kaggle DR dataset, and the result shows that it achieves state-of-the-art performance.