Facial actions are spatio-temporal signals by nature, and therefore their modeling is crucially dependent on the availability of temporal information. In this paper, we focus on inferring such temporal dynamics of facial actions when no explicit temporal information is available, i.e. from still images. We present a novel self-supervised learning approach to capture multiple scales of temporal dynamics, with an application to facial Action Unit (AU) intensity estimation and dimensional affect estimation. In particular: 1. We propose a framework that infers a dynamic representation (DR) from a still image, capturing the bi-directional flow of time within a short time-window centered at the input image; 2. We show that the proposed rank loss can apply facial temporal evolution to self-supervise the training process without using target representations, allowing the network to represent dynamics more broadly; 3. We propose a multiple temporal scale approach that infers DRs for different window lengths (MDR) from a still image. We empirically validate the value of our approach on the task of frame ranking, and show how our proposed MDR attains state of the art results on BP4D for AU intensity estimation and on SEMAINE for dimensional affect estimation, using only still images at test time.