I go through the difference between on-policy and off-policy algorithms for reinforcement learning

TODO: CITE STUFF


Notation: At time we define state as , action as and reward as . A policy is a map between states and actions, i.e., . We refer to function as the expected total, discounted, reward an agent will receive if it starts at state , takes actions , and then follows the policy forever after.

On-Policy vs. Off-Policy

The difference between on-policy and off-policy algorithms is what data we use train the agents’ policy (referred to as the target policy).

In an on-policy approach, we train the target policy using only data from the current policy . If this policy changes (through a gradient update for example), we discard any data collected with the old policy. In contrast, an off-policy approach can learn from data collected by a different policy (often called a behavior policy). This allows us to reuse old data stored in a replay buffer, making learning more sample efficient.

Before showing a specific examples of on- and off-policy approaches, I’ll outline the connection between the -function and the policy . The policy, , dictates which action to take in state . A policy can be derived directly from a -function. For instance, the greedy policy with respect to is the policy that always chooses the action with the highest -value:

This is the “best” policy one can follow if the -function is accurate. During learning, algorithms often use an -greedy policy to balance exploration (trying random actions) and exploitation (taking the best-known action):

In essence, the -function learns the value of actions, and the policy uses those values to select actions.

An example of an on-policy algorithm is SARSA, where we learn by computing the temporal difference (TD): the difference between our current estimate of and the most up-to-date “observation” of , known as the TD target. This quantity is calculated by using the actual reward at the next step and is given by , where is the discount factor. The full update is given by

where is the learning rate. The reason SARSA is on-policy is because the Q function is updated using data collected with the current policy . That is, in equation (1) the actions are given by and .

An example of an off-policy algorithm is Q-learning, for which the update is slightly different:

Note that this means that the update to assumes a greedy policy for the rest of the trajectory (hence the max operator), and therefore, learning happens from data that’s colected from a different policy.

Pros and Cons

On-Policy (e.g., SARSA, A2C, PPO)

  • Pros:

    • Stability: Often more stable and easier to tune. Since the policy being improved is the same one generating data, the updates are consistent with the current behavior. TODO: CITE THIS
    • Direct Optimization: The algorithm directly optimizes the performance of the policy you are actually using.
  • Cons:

    • Sample Inefficiency: Every time the policy is updated, all the data collected with the old policy must be thrown away. This is because the -function estimates are only valid for the policy that generated them. This can be extremely slow if collecting data is expensive (e.g., in robotics).

Off-Policy (e.g., Q-Learning, DDPG, SAC)

  • Pros:

    • Sample Efficiency: Off-policy algorithms can learn from data collected at any time, regardless of what policy collected it. They typically store transitions in a large replay buffer and sample batches from it to update the policy.
    • Exploration: Can learn about the optimal policy while following a more exploratory (e.g., -greedy or random) behavior policy.
  • Cons:

    • Higher Variance: Updates can have higher variance because the data comes from a different policy distribution than the one being learned. This mismatch (known as “off-policy shift”) needs to be corrected, often using techniques like importance sampling (which PPO, despite being on-policy, uses a variant of). TODO: CITE THIS.
    • Stability: Can be less stable and more sensitive to hyperparameter tuning compared to their on-policy counterparts, especially when combined with deep neural networks. TODO: CITE THIS.