Oh nice, I was interested on doing mechanistic interpretability on decision transformers myself and had gotten started during SERI MATS but now was more interested in looking into algorithm distillation and the decision transformers stuff fell to the wayside(plus I haven’t been very productive during the last few weeks unfortunately).
It’s too late to read the post in detail today but will probably read it in detail and look at the repo tomorrow.
I’m interested in helping with this and I’m likely going to be working on some related research in the near future anyway.
Also btw I think that once someone gets to the point that we understand what’s going on the setup from the original dt paper it would be interesting to look into this:
https://arxiv.org/abs/2201.12122
Also the dt paper finds their model generalizes to bigger rtg than the training set in the seaquest env and it would be interesting to get a mechanistic explanation of why that happens (tough that’s an atari task and I think you are right in that that’s probably going to have to come later cause CNN are probably harder to work with).
Another thing to note is that OpenAI’s VPT while it’s technically not a decision transformer (because it doesn’t predict rewards if I remember correctly) it a similar kind of thing in that is Ofline RL as sequence prediction, and is probably one of the biggest publicly avaliable pretrained models of this kind.
There’s also multiple open source implementation of Gato that could probably be interesting to try to do interpretability on. https://github.com/Shanghai-Digital-Brain-Laboratory/BDM-DB1
Also training decision transformers on minerl(or in eleuther’s future minetest enviroment) seems like what might come next after atari(the task gato is trained are mostly atari games and google stuff that is not publicly avaliable if I remember correctly)
(sorry if this is too rambly I’m half asleep and got excited because I think work on dt is a very potentially promising area on alignment and was procrastinating on writing a post trying to convince more people to work on it, and I’m pleasantly suprised other people had the same idea)
Glad you are keen on this area, I’d be very happy to collaborate. I’ll respond to your comments here but am happy to talk more after you’ve read the post.
The linked paper (Can Wikipedia help Offline Reinforcement Learning) is very interesting in a few ways, however, I’d be interested in targeted reasons to investigate this specifically. I think working with larger models is often justified but it might make sense to squeeze more juice out of the small models before moving to larger models. Happy to hear the arguments though.
Also the dt paper finds their model generalizes to bigger rtg than the training set in the seaquest env and it would be interesting to get a mechanistic explanation of why that happens (tough that’s an atari task and I think you are right in that that’s probably going to have to come later cause CNN are probably harder to work with)
I’m glad you asked about this. In terms of extrapolation, an earlier model I trained seemed to behave like this. Analysis techniques like the RTG Scan functionality in the app present ways to explore the mechanisms behind this which I decided not to explore in this post (and possibly in general) for a few reasons:
It’s not clear to me that this is more than a coincidence. I think it could be that in the space of functions that map RTG to behaviour, for certain games, it is possible to learn coincidentally extrapolating functions. If the model were to develop qualitatively different behaviour (under some definition) in out-of-distribution RTG ranges for any task, then my interest will be renewed.
I suspect doing something like integrated gradients for CNN layers is pretty doable (maybe that’s what the MATS shard team have done, see one of Alex Turner’s comments) but yeah, they are probably harder to work with.
Another thing to note is that OpenAI’s VPT while it’s technically not a decision transformer (because it doesn’t predict rewards if I remember correctly) it a similar kind of thing in that is Ofline RL as sequence prediction, and is probably one of the biggest publicly avaliable pretrained models of this kind. There’s also multiple open source implementation of Gato that could probably be interesting to try to do interpretability on. https://github.com/Shanghai-Digital-Brain-Laboratory/BDM-DB1
Thank you for sharing this! I’d be very excited to see attempts to understand these models. I’ve started with toy models for reasons like simplicity and complete control but I can see many arguments in favour of jumping to larger models. The main challenge I see would be loading the weights into a TransformerLens model so we can get the cache enabling easy analysis. This is likely quite doable.
Also training decision transformers on minerl(or in eleuther’s future minetest enviroment) seems like what might come next after atari(the task gato is trained are mostly atari games and google stuff that is not publicly avaliable if I remember correctly)
Oh nice, I was interested on doing mechanistic interpretability on decision transformers myself and had gotten started during SERI MATS but now was more interested in looking into algorithm distillation and the decision transformers stuff fell to the wayside(plus I haven’t been very productive during the last few weeks unfortunately). It’s too late to read the post in detail today but will probably read it in detail and look at the repo tomorrow. I’m interested in helping with this and I’m likely going to be working on some related research in the near future anyway. Also btw I think that once someone gets to the point that we understand what’s going on the setup from the original dt paper it would be interesting to look into this: https://arxiv.org/abs/2201.12122
Also the dt paper finds their model generalizes to bigger rtg than the training set in the seaquest env and it would be interesting to get a mechanistic explanation of why that happens (tough that’s an atari task and I think you are right in that that’s probably going to have to come later cause CNN are probably harder to work with).
Another thing to note is that OpenAI’s VPT while it’s technically not a decision transformer (because it doesn’t predict rewards if I remember correctly) it a similar kind of thing in that is Ofline RL as sequence prediction, and is probably one of the biggest publicly avaliable pretrained models of this kind. There’s also multiple open source implementation of Gato that could probably be interesting to try to do interpretability on. https://github.com/Shanghai-Digital-Brain-Laboratory/BDM-DB1
Also training decision transformers on minerl(or in eleuther’s future minetest enviroment) seems like what might come next after atari(the task gato is trained are mostly atari games and google stuff that is not publicly avaliable if I remember correctly)
(sorry if this is too rambly I’m half asleep and got excited because I think work on dt is a very potentially promising area on alignment and was procrastinating on writing a post trying to convince more people to work on it, and I’m pleasantly suprised other people had the same idea)
Hi Victor,
Glad you are keen on this area, I’d be very happy to collaborate. I’ll respond to your comments here but am happy to talk more after you’ve read the post.
The linked paper (Can Wikipedia help Offline Reinforcement Learning) is very interesting in a few ways, however, I’d be interested in targeted reasons to investigate this specifically. I think working with larger models is often justified but it might make sense to squeeze more juice out of the small models before moving to larger models. Happy to hear the arguments though.
I’m glad you asked about this. In terms of extrapolation, an earlier model I trained seemed to behave like this. Analysis techniques like the RTG Scan functionality in the app present ways to explore the mechanisms behind this which I decided not to explore in this post (and possibly in general) for a few reasons:
It’s not clear to me that this is more than a coincidence. I think it could be that in the space of functions that map RTG to behaviour, for certain games, it is possible to learn coincidentally extrapolating functions. If the model were to develop qualitatively different behaviour (under some definition) in out-of-distribution RTG ranges for any task, then my interest will be renewed.
I suspect doing something like integrated gradients for CNN layers is pretty doable (maybe that’s what the MATS shard team have done, see one of Alex Turner’s comments) but yeah, they are probably harder to work with.
Thank you for sharing this! I’d be very excited to see attempts to understand these models. I’ve started with toy models for reasons like simplicity and complete control but I can see many arguments in favour of jumping to larger models. The main challenge I see would be loading the weights into a TransformerLens model so we can get the cache enabling easy analysis. This is likely quite doable.
Interesting!