Dynamical variational autoencoders (DVAEs) are a generalization of variational autoencoders (VAEs) for the case where the data is correlated through a hidden dynamical system. DVAEs use neural networks to learn how to represent the evolution of a latent state (aka hidden state) which evolves as a function of inputs and time.
In this post I go over the derivation of the loss function used to train a DVAE. I’ll start by deriving the loss for a regular VAE, introduce state-space models as a way of describing dynamical systems, and then derive the loss for a VAE.
Variational Autoencoders (VAEs)
Given a set of observations , e.g., images assume the following:
- Each observation is independently sampled from an unknown distribution .
- Each observation has a probabilistic dependence on an unobserved variable (a.k.a. latent or hidden variable) that contains information about the underlying structure of . Therefore, we can also write .
We wish to learn an approximation of the underlying process that generated , i.e., . Why? A probabilistic model of observed natural and artificial phenomena is useful for decision-making tasks. And in cases where the data’s dimension is very high, having a model that captures its underlying structure in a lower-dimensional representation may simplify controller design. For example, consider a dynamical system one wishes to control, such as a pendulum. A controller that takes in images of the pendulum will be much more complex than one whose input is the pendulum’s state (position and velocity). However, this requires a model that captures the dependency between the images and the state, e.g., and .
One way of learning is by finding parameters that maximize the likelihood of observing the data in . If the data is independent, this is equivalent to maximizing the log-likelihood
Unfortunately, directly maximizing Equation 1 is typically intractable. To see why consider the likelihood of a single datum :
Computing requires marginalizing over the latent variable , which is usually an intractable integral.
The Evidence Lower Bound (ELBO) Loss
Fortunately, we can get around this computation by instead maximizing the lower bound on the data likelihood. We begin by noting that if is intractable, so is the posterior since . Therefore, let be a model with parameters which approximates the posterior and captures the dependency between the latent variable and the observed data. Then, we can write
where the second term in Equation 3 is the Kullback-Leibler (KL) divergence between the approximated posterior and the prior on . Equation 3 is also known as the Evidence Lower Bound (ELBO).
In the literature on VAEs, the posterior distribution is also known as the encoder, inference distribution, or recognition model. On the other hand, the conditioned likelihood is known as the decoder or generative model.
To understand the explicit relationship between the ELBO and first note that is constant with respect to . Then
By re-arranging Equation 4, and noting that the second term is non-negative and only zero when the approximated posterior equals the true posterior, we can see that maximizing the ELBO implies maximizing log-likelihood and minimizing the KL divergence between the approximated and true posteriors.
Therefore, the ELBO is an appropriate objective over the intractable .
State-Space Models (SSMs)
So far we’ve talked about how to represent an input in a lower-dimensional space. If we have a set of images, like in the MNIST case, we can learn to represent features in the digits in a lower dimensional space. This makes sense: they’re all handwritten digits. But what if we know that the observations result from a dynamical system, such as a pendulum, some kind of flow field, or even a videogame? In this case, it might be useful to bake in the observation’s dynamic nature into the latent space we’re trying to learn. Therefore, we need to modify the vanilla VAE, since it’s designed for uncorrelated data.
To get started, it is useful to review state-space models (SSMs), a mathematical model of physical dynamical systems used in time-series analysis, control theory, signal processing, neuroscience, and many other fields. We focus on the discrete-time, continuous-valued SSMs defined by the following equations:
Where Equations 5 and 6 define state transition dynamics, and Equations 7 and 8 define the observation model. The distributions in Equations 6 and 8 are parameterized by and , respectively.
Regarding notation: subscripts index time, and superscripts denote the corresponding variable. In the following derivation we assume access to (we can sample this from an assumed prior distribution) and the complete sequence of control inputs , where is the control horizon.
The ELBO for Dynamical VAEs
Now let’s approximate the distribution that generated the entire data sequence, not just individual data points. This is because the data was generated from a dynamical system and is correlated over time, e.g., if the data is a set of images of a pendulum in motion, the content of one image depends on the content of the previous image. This is not the case for a set of handwritten digits. Formally, we wish to learn a distribution that approximates the distribution which generated given the known control sequence , i.e., .
As one would expect, we run into the same issue as before: maximizing the likelihood requires marginalizing over the latent space which is an intractable operation:
Fortunately, we can get around this with the same approach as before: learn a conditioned likelihood and an approximate posterior that maximize a version of the ELBO adapted for the correlated data.
We derive this new ELBO by first noting that the dynamical system defined in Equation 9 is causal and Markovian. It is a causal system because the distributions for observations and latent variables only depend on their values at previous time steps. And it is a Markovian system because the transition dynamics only depend on the previous state. Therefore, only depends on and , and the approximate posterior can be re-written as:
We can also factorize the joint distribution of the observed and latent sequences as a product of conditional distributions for every timestep, i.e.,
Next, we write the ELBO in terms of the observation, latent, and control sequences, and then substitute the factorizations in Equations 10 and 11:
Now, if we let and write the expectation in terms of the factorized posterior in Equation 10 we have
Resources
- Original VAE paper by Kingma et al.
- Tutorial on VAEs by Kingma and Welling.
- A review of DVAEs by Girin et al. Super useful and includes a much more general exposition of DVAEs. For example, for the case of non-Markovian and non-causal DVAEs. I followed much of their development for this post.
- Danijar Hafner’s super cool work on world models is what originally took me down this rabbit hole.