Dynamical variational autoencoders (DVAEs) are a generalization of variational autoencoders (VAEs) for the case where the data is correlated through a hidden dynamical system. DVAEs use neural networks to learn how to represent the evolution of a latent state (aka hidden state) which evolves as a function of inputs and time.

In this post I go over the derivation of the loss function used to train a DVAE. I’ll start by deriving the loss for a regular VAE, introduce state-space models as a way of describing dynamical systems, and then derive the loss for a VAE.

## Variational Autoencoders (VAEs)

Given a set

- Each observation
is independently sampled from an unknown distribution . - Each observation has a probabilistic dependence on an unobserved variable (a.k.a. latent or hidden variable)
that contains information about the underlying structure of . Therefore, we can also write .

We wish to learn an approximation of the underlying process that generated

One way of learning

Unfortunately, directly maximizing Equation 1 is typically intractable.
To see why consider the likelihood of a single datum

Computing

## The Evidence Lower Bound (ELBO) Loss

Fortunately, we can get around this computation by instead maximizing the lower bound on the data likelihood.
We begin by noting that if

where the second term in Equation 3 is the Kullback-Leibler (KL) divergence between the approximated posterior and the prior on

In the literature on VAEs, the posterior distribution

To understand the explicit relationship between the ELBO and

By re-arranging Equation 4, and noting that the second term is non-negative and only zero when the approximated posterior equals the true posterior, we can see that maximizing the ELBO implies maximizing log-likelihood and minimizing the KL divergence between the approximated and true posteriors.

Therefore, the ELBO is an appropriate objective over the intractable

## State-Space Models (SSMs)

So far we’ve talked about how to represent an input in a lower-dimensional space. If we have a set of images, like in the MNIST case, we can learn to represent features in the digits in a lower dimensional space. This makes sense: they’re all handwritten digits. But what if we know that the observations result from a dynamical system, such as a pendulum, some kind of flow field, or even a videogame? In this case, it might be useful to bake in the observation’s dynamic nature into the latent space we’re trying to learn. Therefore, we need to modify the vanilla VAE, since it’s designed for uncorrelated data.

To get started, it is useful to review state-space models (SSMs), a mathematical model of physical dynamical systems used in time-series analysis, control theory, signal processing, neuroscience, and many other fields. We focus on the discrete-time, continuous-valued SSMs defined by the following equations:

Where Equations 5 and 6 define state transition dynamics, and Equations 7 and 8 define the observation model.
The distributions in Equations 6 and 8 are parameterized by

Regarding notation: subscripts index time, and superscripts denote the corresponding variable.
In the following derivation we assume access to

## The ELBO for Dynamical VAEs

Now let’s approximate the distribution that generated the entire data sequence, not just individual data points.
This is because the data was generated from a dynamical system and is correlated over time, e.g., if the data

As one would expect, we run into the same issue as before: maximizing the likelihood

Fortunately, we can get around this with the same approach as before: learn a conditioned likelihood

We derive this new ELBO by first noting that the dynamical system defined in Equation 9 is causal and Markovian.
It is a causal system because the distributions for observations and latent variables only depend on their values at previous time steps.
And it is a Markovian system because the transition dynamics only depend on the previous state.
Therefore,

We can also factorize the joint distribution of the observed and latent sequences as a product of conditional distributions for every timestep, i.e.,

Next, we write the ELBO in terms of the observation, latent, and control sequences, and then substitute the factorizations in Equations 10 and 11:

Now, if we let

## Resources

- Original VAE paper by Kingma et al.
- Tutorial on VAEs by Kingma and Welling.
- A review of DVAEs by Girin et al. Super useful and includes a much more general exposition of DVAEs. For example, for the case of non-Markovian and non-causal DVAEs. I followed much of their development for this post.
- Danijar Hafner’s super cool work on world models is what originally took me down this rabbit hole.