A note on goal-conditioned RL using contrastive learning, where value is defined by representational similarity to a goal state.
In goal-conditioned RL training with contrastive learning, the critic takes three arguments: a current state
We train this critic using a contrastive loss that pulls representations of actual transitions together and pushes random transitions apart.
Let’s assume we sample a valid trajectory tuple
In practice, we use cosine similarity (normalizing the vectors) rather than raw dot products to prevent magnitude-related instabilities.
During the actor training phase, we repurpose this learned critic to estimate the value of actions relative to a specific goal
High similarity implies the actor is taking an action that transitions the system into a state representationally close to the goal.
Blogpost 19/100