As I dug into the theoretical foundations of active uncertainty reduction in dynamical systems (like robots),
I ran into the field of dual control.
Unfortunately, most of the literature seems to agree that it’s very hard.
In this blog post I explain exactly why it’s hard, and then write about what this means for my research.
We seek to solve the following problem: drive a dynamical system to a state of interest while minimizing cost.
Let’s start by defining state as and controls .
As is standard in optimal control and reinforcement learning, we define a cost given by the quadratic function
where denotes a target or reference trajectory, and , are matrices that define the state and control costs.
Let’s start with the simplest possible dynamical system: a linear, scalar system with noiseless observations given by
where is process noise.
Assuming and are known, and , the optimal control that drives to zero in one step is given by
This can be easily found by substituting the dynamics into the expected cost 1, computing the cost’s derivative with respect to , setting it to zero, and solving for .
Now let’s consider the case where we are uncertain about .
Given a belief at time , a naïve solution is to replace with our current estimate of its mean .
This approach is known as certainty equivalence.
Unfortunately, this does not account for the uncertainty in the belief, i.e., how big is and often leads to poor performance.
Alternatively, we could minimize the expected cost
.
This results in the optimal control
This is also known as “cautious” control, since it’s inversely proportional to the uncertainty given by .
Although this approach accounts for uncertainty, it can prevent learning (a future reduction in the uncertainty of ) if the uncertainty is too high.
To see how this happens, we compute the posterior on after observing , i.e., .
We do so by invoking Bayes’ rule, noting that , and then doing a bunch of algebra.
The result is the following Gaussian
Notice that if is large then implying , and no learning happens.
Also note how even for a simple scalar, linear system, the posterior update is already pretty gnarly.
The fact that the controller doesn’t really account for how uncertainty will be reduced at future timesteps should not come as a surprise: we selected a controller that minimizes cost for a single step.
This is known as a myopic controller because it doesn’t consider the consequences of its actions in horizons longer than .
Dual control was pioneered by Alexander Feldbaum in the 60’s.
A controller in an uncertain system is said to have a dual control effect if it takes actions to 1) simultaneously reduce the uncertainty and 2) drive the system towards the reference or goal trajectory.
As I described in the previous section, if we want a controller to have this effect, we must consider control over trajectories of horizon .
So now, let’s revisit the cost function for longer trajectories.
At every timestep we can define the expected cost of the remaining trajectory in the following recursive manner
where the final equation is defined by
The optimal control sequence can be found by solving the following minimization with dynamic programming:
For more details on how to solve problems of this form, see Section 8.3.1 here.
If we substitute our dynamical system for a horizon of then is the nested expectation
Substituting with the myopic controller and writing the expectations a bit more explicitly we get
The solution to this problem has a dual control effect because solving for requires minimizing a cost that drives the system towards the reference trajectory while accounting for changes in future beliefs (expressed through the nested expectations).
The nested expectations already hint that solving for a control sequence with a dual effect may be tricky, but let’s understand exactly why this is hard.
Let’s consider the innermost expectation:
For brevity, let
This is a joint expectation over two independent Gaussian distributions and .
And even though has nasty expressions for the mean and variance (as shown by the posterior update above) it’s still a Gaussian, and depending on the argument of the expectation , computing it should be relatively easy.
Unfortunately, is indeed very nasty.
Recall that is a nonlinear function of the belief parameters at time , i.e., and .
However, as shown in the belief update above, both parameters are functions (also nonlinear) of the belief parameters in the previous timestep, i.e., and .
This means that is a very nonlinear function of and with a form that implies is analytically intractable2.
Therefore, the best we can do is to numerically approximate the expectation, e.g., by sampling.
This approach works, but it scales poorly to higher dimensions, and it’s still an approximation.
Why Dual Control is Hard
Dual control is hard because even in simple dynamical systems, the belief dynamics are nonlinear (courtesy of Bayes’ rule) resulting in intractable nested expectations when reasoning about expected belief updates over long horizons ().
I just described dual control in the context of a dynamical system where uncertainty is represented as a Gaussian belief over an unknown parameter in a linear system.
In this case the transition function is
However, as I explained in my previous post, Gaussian Processes and Meta-Learning Priors, I’m interested in leveraging more expressive representations of dynamical systems such as neural networks trained to imitate lots of example trajectories, i.e., the transition function is , where is a deep neural network parameterized by weights and biases .
Let’s talk about how to represent uncertainty when the transition function is a neural network.
In the linear dynamics model we have uncertainty over process noise and model parameter values .
Practically speaking, this means that when minimizing a cost we must compute its joint expectation over and , i.e., .
On the other hand, when using a neural network to represent dynamics, there are many ways to reason about the uncertainty of the next step (e.g., neural network ensembles, Monte Carlo Dropout, conformal mapping, etc.), and also the model parameters (e.g., Bayesian neural networks)
Bayesian neural networks are probably the closest to what we’re trying to do since they define a posterior distribution over model parameters, i.e., instead of a point estimate for the model weights, we get a distribution conditioned on the training data.
Unfortunately, as far as I understand, Bayesian neural networks are hard to scale, and adding dual control, which, as I illustrated here, is already very hard, is unlikely to make the problem easier.
Therefore, if I want to combine dual control with expressive dynamics models such as neural networks, I’m going to have to reason about how to get the best of both worlds: expressiveness and tractability of weight updates.
ALPaCA kinda does this: it’s like a Bayesian neural network with uncertainty only over the last, linear layer.
It also uses a recursive Bayesian update based on least squares that’s supposed to be more efficient.
Now the question is how this translates to a dual control setting:
How can such an update translate to a dual control setting?
Are we subject to the same pitfalls?
Another direction I’ll be exploring is the connection between the expected weight updates in belief-space planning (see this paper by Platt et al) and those in the ALPaCA.
David had a hunch that there might be an equivalence, or at the very least an interesting connection, between the ALPaCA weight updates and the EKF parameter updates in Platt’s paper.
You’ll notice that this cost looks different from the one I defined before. This is because we’re driving the cost to zero in a single step, so the expression simplifies to . ↩
It turns out that can be written as a rational function , where and are polynomials. This means that the expected value will most likely not have a closed form solution. I know there are cases where it is possible to compute the expectation of such rational functions, but as far as I’m aware, the results are limited to very specific polynomials. Here’s an example. ↩