Here are a few possibly dumb questions which spring to mind:
What is the distribution of RTG in the dataset taken from the PPO agent? Presumably it is quite biased towards positive reward? Does this help to explain the state embedding having a left/right preference?
Is a good approximation of the RTG=-1 model just the RTG=1 model with a linear left bias?
Does the state tokenizer allow the DT to see that similar positions are close to each other in state space even after you flatten? If not, might this be introducing some weird effects?
Really interesting and impressive work, Joseph.
Here are a few possibly dumb questions which spring to mind:
What is the distribution of RTG in the dataset taken from the PPO agent? Presumably it is quite biased towards positive reward? Does this help to explain the state embedding having a left/right preference?
Is a good approximation of the RTG=-1 model just the RTG=1 model with a linear left bias?
Does the state tokenizer allow the DT to see that similar positions are close to each other in state space even after you flatten? If not, might this be introducing some weird effects?