A note on goal-conditioned RL using contrastive learning, where value is defined by representational similarity to a goal state.


In goal-conditioned RL training with contrastive learning, the critic takes three arguments: a current state , an action , and a query state . Internally, the network is split into two projection heads: a state-action encoder and a query state encoder .

We train this critic using a contrastive loss that pulls representations of actual transitions together and pushes random transitions apart. Let’s assume we sample a valid trajectory tuple and a random negative sample from our buffer. The objective is to maximize the similarity between the current state-action pair and the true next state, while minimizing similarity to the random state. Formally, the loss function is given by:

In practice, we use cosine similarity (normalizing the vectors) rather than raw dot products to prevent magnitude-related instabilities.

During the actor training phase, we repurpose this learned critic to estimate the value of actions relative to a specific goal . We define the value as the proximity of the current state-action representation to the goal representation:

High similarity implies the actor is taking an action that transitions the system into a state representationally close to the goal.


Blogpost 19/100