I derive the vanilla policy gradient and then estimate it with the generalized advantage estimator which can parametrically trade between bias and variance.


Notation: At time we define state as , action as and reward as . A policy defines the agent’s behavior. We are interested in a stochastic policy parameterized by , denoted , which specifies the probability of taking action in state . We refer to function as the expected total, discounted reward an agent will receive if it starts at state , takes actions , and then follows the policy forever after.

Policy Gradient

Derivation

Let trajectory of horizon be a sequence of states and actions i.e.,

Given a reward function , let the reward at time be defined as

The finite-horizon undiscounted reward is given by

Now, consider the problem of finding a stochastic policy that maximizes the expected return This can be framed as the following optimization problem:

which we could solve through gradient ascent:

where is the learning rate.

Our next task then is to find an expression for the policy gradient . First, we write in terms of the expectation

The probability of a trajectory, conditioned on parameters , is given by

where is the initial state distribution, and denotes the stochastic dynamics. Then, the expectation in (1) can be written as

Evaluating might be really hard (or impossible) because

  1. and may be unknown
  2. Even if they are known, they might be too complex or unknown to differentiate through. 1

To get around this, we can re-write it in terms of easier-to-evaluate log-probabilities. First we note that . Then, we can write:

And if we re-arrange, substitute, and simplify:

Plugging back into (2):

And re-writing in terms of an expectation:

Estimation

In practice, we compute by rolling out trajectories and computing a Monte Carlo estimate of the expectation. That is, if we have trajectories , the Monte Carlo estimate is given by

Unfortunately, this is a very high-variance estimate (trust me on this, for now). In the next section, we introduce a variance reduction technique where we replace with an advantage function that computes improvements over a baseline instead of using raw rewards.

The Generalized Advantage Estimator

In this section, I’ll start by deriving the advantage function. Then, I’ll show how using this function (instead of the raw reward), provides a lower-variance estimator of . Finally, I’ll derive the generalized advantage function which provides an even lower-variance estimator.

Bias of a baselined policy gradient estimator

A baseline is a function that’s substracted from the return , implying that the policy gradient is given by

It turns out that we can introduce any baseline without biasing the policy gradient expectation as long as the baseline does not depend on or . To show this, we first note that due to the linearity of expectation, we can write the above expression in two terms:

Our task is to show that the second term is equal to zero.

First, let , . Then, by applying the linearity of expectation again, the right term is given by

Now let’s look at an expectation for a single timestep .

If we substitute from the first section and rearrange we have

If we re-arrange the integrals over actions, we can express as follows:

In this form, it is clear that most of the nested integrals simplify to 1. After collapsing most of the integrals we’re left with

where the underbrace going to zero is a result of applying the log trick shown in the first section (this is also a known property of score functions). Therefore,

implying that expectation of the baselined reward does not depend on , i.e.,

Now that we’ve shown that adding does not introduce bias, we’re ready to explore the exact form of .

The advantage function

It turns out that there are several related expressions for the policy gradient (see Schulman et al’s paper for more details). The estimator I derived above used the reward, but there are many other valid expressions all of which have the following form:

Different forms of the policy gradient from Schulman et al..

As discussed in Schulman’s paper, it turns out that using the advantage function yields an estimator of the policy gradient with almost the lowest possible variance. Simply stated, the advantage function measures the following: “If I take action at state , what is the reward I’ll get compared to the expected reward of ?” Concretely, given a stochastic policy , the expected reward of is given by the expectation of reward over the policy, i.e.,

Then, the advantage of action at state is given by

The problem with using the advantage function is that, in practice, it is unknown (we don’t know what the value function is). One way of approximating the advantage is to use TD residuals. The TD residual with discount is defined as

It turns out that, if is the correct value function, we can use this TD residual as an unbiased estimator of the advantage . (See equation 10 in Schulman’s paper). We denote this approximation as

The problem is that if all we have is an approximation of the value function (e.g., a neural network) is not unbiased. We can remedy this by using the observed rewards for more steps, and essentially holding off on applying the value function for longer. We do so by unrolling the value function , i.e.,

If we manipulate it a little, we can re-write this as a discounted sum of two TD differences:

Therefore, we can define a whole family of estimators for the advantage function:

Turns out, that as , the term approaches zero. This is great because this is exactly the term which adds a bias to the advantage estimate (because is a biased approximation of the true value function). And therefore, as , the -step advantage estimator becomes unbiased!

Unfortunately, there is no free lunch because our estimator is now in terms of an empirical mean, which unbiased as it may be, is much higher in variance. The empirical mean has high variance because it sums rewards over an entire trajectory, so the randomness from every single action and state transition compounds, causing the final total return to vary significantly from one rollout to the next. In the next section I go into one way of dealing with this.

The generalized advantage estimator

The generalized advantage estimator is simply the exponentially weighted average of the -step estimators, i.e.,

This weighted average provides a way of compromising between bias and variance. Consider the following two extremes:

  • The 1-step estimator (): This is . It is stable but biased, as it relies heavily on our approximate value function .
  • The -step estimator (): This is , the sum of all rewards. It is unbiased but noisy, as the sum of many random rewards has high variance.

In the GAE, the parameter acts as a control knob for this blend. For an in-depth analysis, take a look at Section 3 in Schulman’s paper.

Thank you for reading! In the next section, I’ll go into how the GAE is used in a popular RL algorithm: PPO.


Resources

Policy gradient

Footnotes

  1. Incidentally, differentiating through the transition function is exactly what I did in this post. The caveat is that I had access to a very nice differentiable simulator.