Thanks for writing this! I have a few questions, to make sure I’m understanding the architecture correctly:
So, essentially, the above pulls a transformation of the current state, fed through an action-conditioned dynamics network, close to a transformation from the future state.
Is this transformation an image augmentation, like you mention regarding SimSiam? Is the transformation the same for both states? And is the dynamics network also trained in this step?
Is the representation network trained in the same operation that trains the state-value and action-value functions? If not, what is to stop the representation function from being extremely similar regardless of the actual state?
The third technique, the one where older episodes have fewer timesteps before using the state-value function to truncate the rewards, makes me wonder if there has been any research where the final policy “grades” the quality of previous actions. It seems like that could shed some light on why this sort of technique works.
Thanks for writing this! I have a few questions, to make sure I’m understanding the architecture correctly:
Is this transformation an image augmentation, like you mention regarding SimSiam? Is the transformation the same for both states? And is the dynamics network also trained in this step?
Is the representation network trained in the same operation that trains the state-value and action-value functions? If not, what is to stop the representation function from being extremely similar regardless of the actual state?
The third technique, the one where older episodes have fewer timesteps before using the state-value function to truncate the rewards, makes me wonder if there has been any research where the final policy “grades” the quality of previous actions. It seems like that could shed some light on why this sort of technique works.