Notes on PILCO (Deisenroth & Rasmussen, 2011) and deep PILCO (Gal et al., 2016), two seminal model-based RL papers.


Model-based RL and PILCO

Model-based RL learns the dynamics model for the underlying Markov decision process (MDP). Since it learns to model the environment dynamics, it is, in theory, a more data-efficient way of finding good policies because it allows the algorithm to cheaply generate “synthetic” or “imagined” trajectories from which it can learn. One way of finding a policy under this regime is probabilistic inference for learning control (PILCO) (Deisenroth & Rasmussen, 2011), a policy search model that learns a probabilistic dynamics model.

PILCO uses a Gaussian process (GP) to learn the dynamics model. Having a (learned) model allows for policy search based on analytic policy gradients. This is unlike model-free approaches, which approximate the gradient by treating the dynamics as a black box from which we can sample but cannot differentiate through (check out this policy gradient derivation to see what I mean.) In PILCO, the GP yields one-step predictions of the deviation between the current state and the next one. The GP’s kernel hyperparameters are trained using evidence maximization, and inference is performed by conditioning the trained distribution on a test input (which has an analytic form, since we are dealing with Gaussians). A detailed description of GPs, how to train them, and how to perform inference with them is found here.

Uncertainty Propagation and Model Bias in PILCO

An important feature of PILCO is that it propagates model uncertainty over time. This has an important consequence: by predicting uncertainty over time, we can reduce model bias during policy optimization. Model bias refers to the problem of acting on an imperfect model as if it were perfect. This may be problematic when optimizing for the policy because, without some notion of model uncertainty, the policy may be optimized for a dynamics model that does not exist. To see why a probabilistic dynamics model reduces model bias, consider the following two scenarios.

First, consider low uncertainty on the predicted state (due to, e.g., using a deterministic dynamics model). The predicted state will most likely be in a narrow range in the state space. The expected cost is the average of the cost at each of these points (weighted by the state probability). Since most of the probability is concentrated on a small set of points, the expected cost is mostly just the cost of a few points. If the distribution moves around (due to changes in the model parameters), then changes in the expected cost will closely match changes in the cost landscape. Therefore, if the gradient of cost with respect to state is high, so will the gradient of expected cost with respect to model parameters.

Second, consider high uncertainty on the predicted state. The predicted state will most likely be found in a wide range of the state space. Since most of the probability is concentrated on a large set of points, the expected cost is the sum of many points on the cost landscape. Therefore, if the distribution moves around (due to changes in the model parameters), the expected cost may not change very much (because we are averaging over many points on the cost landscape). Therefore, even if the gradient of cost with respect to state is high, the gradient of expected cost with respect to model parameters will most likely be low.

Therefore, using a probabilistic dynamics model for policy optimization reduces model bias (compared to a deterministic model). By reducing the gradient information for states that are uncertain, the policy is optimized only on the state transitions it is highly certain about. This is unlike a deterministic model, which may use gradient information for junk predictions.

One could argue that this increases data efficiency because it means that policy rollouts are not wasted on actions selected due to wrong state predictions. On the other hand, one could argue that if what we are after is an accurate model, by not exploring the areas of high uncertainty, we will actually take longer to learn a model, because we are exploiting the current learned model rather than exploring for a better one. Two relevant papers here are (Sukhija et al., 2023) and (Curi et al., 2020), which directly explot model uncertainty to guide exploration, and therefore, model performance.

Deep PILCO

The problem with GPs is that the matrices inverted during inference scale quadratically with training data; therefore, GP inference scales cubically, as general algorithms for inverting matrices have a computational complexity of (source). To address this (Gal et al., 2016) proposes Deep PILCO, a model-based RL algorithm that uses Bayesian neural networks (BNNs) instead of GPs to represent the dynamics model. The advantage of BNNs is that the computational complexity for training scales linearly with the number of trials and observation space. BNNs represent model uncertainty by using a posterior distribution over the weights of the neural network, i.e., a distribution for weight values conditioned on the training data. A key difficulty of using BNNs is that computing the true posterior is intractable. Therefore, Gal et al. propose a variational inference approach that minimizes the KL divergence between the true posterior and a tractable family of distributions. Specifically, they leverage the fact that dropout training can be framed as approximate Bayesian inference (Gal & Ghahramani, 2016), which allows for both weight and prediction uncertainty.

Another key difficulty is how to propagate uncertainty over time. Regular PILCO can do this with relative ease because given a distribution , computing the distribution of the next state is analytically tractable with a GP. Concretely, in PILCO, given a GP with a squared exponential kernel and a Gaussian input distribution , we can write exact, analytic equations for the mean and variance. This is known as moment matching and is an approximation because we are only approximating the mean and the variance of the potentially non-Gaussian . Doing this with a BNN is not as straightforward as, in general, there is no analytical expression for the new mean and variance. Instead, Deep PILCO uses particle methods to approximate the predictive distribution. Concretely, in Deep PILCO, we approximate the moments of by:

  1. Sampling particles from the input Gaussian .
  2. Propagating each particle through the BNN.
  3. Fitting a new Gaussian distribution by calculating the mean and covariance of the propagated particles.

Final Thoughts

  • What does this mean for the sampling-based exploration algorithm I’m working on? One conclusion, is of course, that model-bias is definitely a problem if I only use a single realization during planning. As painful as it might be, it’s probably a good idea to reason about model uncertainty during planning. Either by sampling or by explicitly modelling uncertainty over time.
  • Can this uncertainty-aware optimization approach be used for language model inference? And can this be used to optimize attention? Or the information we put in the context window?
  • How can we be aware that there is a better cost? Mountain Car, for example, has no way of knowing there is a better cost unless it has gotten a taste for it. So, if “exploration” happens to get lucky enough to find out there could be a higher reward, it’s all just luck. This begs the question: when doing uncertainty-guided exploration, how do we know what uncertain areas of the state space are worth exploring? Value of information. How does this apply for language models? What if we say something like: my current response resulted in reward R, but due to pre-training knowledge, I know a reward of R+1 is possible. So, exploration is worth it. I should be more precise here, but this is an interesting idea.

Bibliography

Curi, S., Berkenkamp, F., & Krause, A. (2020). Efficient model-based reinforcement learning through optimistic policy search and planning. Advances in Neural Information Processing Systems, 33, 14156–14170.
Deisenroth, M., & Rasmussen, C. E. (2011). PILCO: A model-based and data-efficient approach to policy search. Proceedings of the 28th International Conference on Machine Learning (ICML-11), 465–472.
Gal, Y., & Ghahramani, Z. (2016). Dropout as a bayesian approximation: Representing model uncertainty in deep learning. International Conference on Machine Learning, 1050–1059.
Gal, Y., McAllister, R., & Rasmussen, C. E. (2016). Improving PILCO with Bayesian neural network dynamics models. Data-Efficient Machine Learning Workshop, ICML, 4(34), 25.
Sukhija, B., Treven, L., Sancaktar, C., Blaes, S., Coros, S., & Krause, A. (2023). Optimistic active exploration of dynamical systems. Advances in Neural Information Processing Systems, 36, 38122–38153.