Here are a few possibly dumb questions which spring to mind:
What is the distribution of RTG in the dataset taken from the PPO agent? Presumably it is quite biased towards positive reward? Does this help to explain the state embedding having a left/right preference?
Is a good approximation of the RTG=-1 model just the RTG=1 model with a linear left bias?
Does the state tokenizer allow the DT to see that similar positions are close to each other in state space even after you flatten? If not, might this be introducing some weird effects?
My apologies! I thought I had responded to this. Better late than never. All reasonable questions.
“What is the distribution of RTG in the dataset taken from the PPO agent? Presumably it is quite biased towards positive reward? Does this help to explain the state embedding having a left/right preference?”
The RTG distribution taken from the PPO agent is shown in Figure 4. It is the marginal distribution of reward, since the reward only occurs at the end of the episode, the RTG and Reward distributions are identical more/less.
If you are referring to the tendency for the agent conditioned on low RTG (0.3) to get RTG closer to (0.8) as “biased towards positive reward”. I think this is a function of the training distribution which includes way more RTG~0.8 than 0.3 so yes.
The state embedding having a left/right preference is probably a function of the PPO agent getting success going clockwise, this being reinforced and then this dominating the training data for the DT.
“Is a good approximation of the RTG=-1 model just the RTG=1 model with a linear left bias?”
I’m not sure what you mean by “linear left bias”. the RTG=-1 model behaves characteristically differently (see table 1).
”Does the state tokenizer allow the DT to see that similar positions are close to each other in state space even after you flatten? If not, might this be introducing some weird effects?”
Yes I believe the encoding used had that property. I’ve seen replicated these results with a less nasty encoding and the results are mostly similar if a little easier to interpret.
Really interesting and impressive work, Joseph.
Here are a few possibly dumb questions which spring to mind:
What is the distribution of RTG in the dataset taken from the PPO agent? Presumably it is quite biased towards positive reward? Does this help to explain the state embedding having a left/right preference?
Is a good approximation of the RTG=-1 model just the RTG=1 model with a linear left bias?
Does the state tokenizer allow the DT to see that similar positions are close to each other in state space even after you flatten? If not, might this be introducing some weird effects?
My apologies! I thought I had responded to this. Better late than never. All reasonable questions.
“What is the distribution of RTG in the dataset taken from the PPO agent? Presumably it is quite biased towards positive reward? Does this help to explain the state embedding having a left/right preference?”
The RTG distribution taken from the PPO agent is shown in Figure 4. It is the marginal distribution of reward, since the reward only occurs at the end of the episode, the RTG and Reward distributions are identical more/less.
If you are referring to the tendency for the agent conditioned on low RTG (0.3) to get RTG closer to (0.8) as “biased towards positive reward”. I think this is a function of the training distribution which includes way more RTG~0.8 than 0.3 so yes.
The state embedding having a left/right preference is probably a function of the PPO agent getting success going clockwise, this being reinforced and then this dominating the training data for the DT.
“Is a good approximation of the RTG=-1 model just the RTG=1 model with a linear left bias?”
I’m not sure what you mean by “linear left bias”. the RTG=-1 model behaves characteristically differently (see table 1).
”Does the state tokenizer allow the DT to see that similar positions are close to each other in state space even after you flatten? If not, might this be introducing some weird effects?”
Yes I believe the encoding used had that property. I’ve seen replicated these results with a less nasty encoding and the results are mostly similar if a little easier to interpret.