I go through the difference between on-policy and off-policy algorithms for reinforcement learning
TODO: CITE STUFF
Notation:
At time
On-Policy vs. Off-Policy
The difference between on-policy and off-policy algorithms is what data we use train the agents’ policy (referred to as the target policy).
In an on-policy approach, we train the target policy using only data from the current policy
Before showing a specific examples of on- and off-policy approaches, I’ll outline the connection between the
This is the “best” policy one can follow if the
In essence, the
An example of an on-policy algorithm is SARSA, where we learn
where
An example of an off-policy algorithm is Q-learning, for which the update is slightly different:
Note that this means that the update to
Pros and Cons
On-Policy (e.g., SARSA, A2C, PPO)
-
Pros:
- Stability: Often more stable and easier to tune. Since the policy being improved is the same one generating data, the updates are consistent with the current behavior. TODO: CITE THIS
- Direct Optimization: The algorithm directly optimizes the performance of the policy you are actually using.
-
Cons:
- Sample Inefficiency: Every time the policy
is updated, all the data collected with the old policy must be thrown away. This is because the -function estimates are only valid for the policy that generated them. This can be extremely slow if collecting data is expensive (e.g., in robotics).
- Sample Inefficiency: Every time the policy
Off-Policy (e.g., Q-Learning, DDPG, SAC)
-
Pros:
- Sample Efficiency: Off-policy algorithms can learn from data collected at any time, regardless of what policy collected it. They typically store transitions in a large replay buffer and sample batches from it to update the policy.
- Exploration: Can learn about the optimal policy while following a more exploratory (e.g.,
-greedy or random) behavior policy.
-
Cons:
- Higher Variance: Updates can have higher variance because the data comes from a different policy distribution than the one being learned. This mismatch (known as “off-policy shift”) needs to be corrected, often using techniques like importance sampling (which PPO, despite being on-policy, uses a variant of). TODO: CITE THIS.
- Stability: Can be less stable and more sensitive to hyperparameter tuning compared to their on-policy counterparts, especially when combined with deep neural networks. TODO: CITE THIS.