Notes on PILCO (Deisenroth & Rasmussen, 2011) and deep PILCO (Gal et al., 2016), two seminal model-based RL papers.
Model-based RL and PILCO
Model-based RL learns the dynamics model for the underlying Markov decision process (MDP). Since it learns to model the environment dynamics, it is, in theory, a more data-efficient way of finding good policies because it allows the algorithm to cheaply generate “synthetic” or “imagined” trajectories from which it can learn. One way of finding a policy under this regime is probabilistic inference for learning control (PILCO) (Deisenroth & Rasmussen, 2011), a policy search model that learns a probabilistic dynamics model.
PILCO uses a Gaussian process (GP) to learn the dynamics model. Having a (learned) model allows for policy search based on analytic policy gradients. This is unlike model-free approaches, which approximate the gradient by treating the dynamics as a black box from which we can sample but cannot differentiate through (check out this policy gradient derivation to see what I mean.) In PILCO, the GP yields one-step predictions of the deviation between the current state and the next one. The GP’s kernel hyperparameters are trained using evidence maximization, and inference is performed by conditioning the trained distribution on a test input (which has an analytic form, since we are dealing with Gaussians). A detailed description of GPs, how to train them, and how to perform inference with them is found here.
Uncertainty Propagation and Model Bias in PILCO
An important feature of PILCO is that it propagates model uncertainty over time. This has an important consequence: by predicting uncertainty over time, we can reduce model bias during policy optimization. Model bias refers to the problem of acting on an imperfect model as if it were perfect. This may be problematic when optimizing for the policy because, without some notion of model uncertainty, the policy may be optimized for a dynamics model that does not exist. To see why a probabilistic dynamics model reduces model bias, consider the following two scenarios.
First, consider low uncertainty on the predicted state (due to, e.g., using a deterministic dynamics model). The predicted state will most likely be in a narrow range in the state space. The expected cost is the average of the cost at each of these points (weighted by the state probability). Since most of the probability is concentrated on a small set of points, the expected cost is mostly just the cost of a few points. If the distribution moves around (due to changes in the model parameters), then changes in the expected cost will closely match changes in the cost landscape. Therefore, if the gradient of cost with respect to state is high, so will the gradient of expected cost with respect to model parameters.
Second, consider high uncertainty on the predicted state. The predicted state will most likely be found in a wide range of the state space. Since most of the probability is concentrated on a large set of points, the expected cost is the sum of many points on the cost landscape. Therefore, if the distribution moves around (due to changes in the model parameters), the expected cost may not change very much (because we are averaging over many points on the cost landscape). Therefore, even if the gradient of cost with respect to state is high, the gradient of expected cost with respect to model parameters will most likely be low.
Therefore, using a probabilistic dynamics model for policy optimization reduces model bias (compared to a deterministic model). By reducing the gradient information for states that are uncertain, the policy is optimized only on the state transitions it is highly certain about. This is unlike a deterministic model, which may use gradient information for junk predictions.
One could argue that this increases data efficiency because it means that policy rollouts are not wasted on actions selected due to wrong state predictions. On the other hand, one could argue that if what we are after is an accurate model, by not exploring the areas of high uncertainty, we will actually take longer to learn a model, because we are exploiting the current learned model rather than exploring for a better one. Two relevant papers here are (Sukhija et al., 2023) and (Curi et al., 2020), which directly explot model uncertainty to guide exploration, and therefore, model performance.
Deep PILCO
The problem with GPs is that the matrices inverted during inference scale quadratically with training data; therefore, GP inference scales cubically, as general algorithms for inverting
Another key difficulty is how to propagate uncertainty over time.
Regular PILCO can do this with relative ease because given a distribution
- Sampling
particles from the input Gaussian . - Propagating each particle through the BNN.
- Fitting a new Gaussian distribution
by calculating the mean and covariance of the propagated particles.
Final Thoughts
- What does this mean for the sampling-based exploration algorithm I’m working on? One conclusion, is of course, that model-bias is definitely a problem if I only use a single realization during planning. As painful as it might be, it’s probably a good idea to reason about model uncertainty during planning. Either by sampling or by explicitly modelling uncertainty over time.
- Can this uncertainty-aware optimization approach be used for language model inference? And can this be used to optimize attention? Or the information we put in the context window?
- How can we be aware that there is a better cost? Mountain Car, for example, has no way of knowing there is a better cost unless it has gotten a taste for it. So, if “exploration” happens to get lucky enough to find out there could be a higher reward, it’s all just luck. This begs the question: when doing uncertainty-guided exploration, how do we know what uncertain areas of the state space are worth exploring? Value of information. How does this apply for language models? What if we say something like: my current response resulted in reward R, but due to pre-training knowledge, I know a reward of R+1 is possible. So, exploration is worth it. I should be more precise here, but this is an interesting idea.