John Steidley comments on interpreting GPT: the logit lens

John Steidley 1 Sep 2020 4:20 UTC
LW: 3 AF: 2
AF
It doesn’t sound hard at all. The things Gwern is describing are the same sort of thing that people do for interpretability where they, eg, find an image that maximizes the probability of the network predicting a target class.
Of course, you need access to the model, so only OpenAI could do it for GPT-3 right now.
- gwern 2 Sep 2020 0:02 UTC
  LW: 5 AF: 2
  AF Parent
  Doing it with GPT-3 would be quite challenging just for compute requirements like RAM. You’d want to test this out on GPT-2-117M first, definitely. If the approach works at all, it should work well for the smallest models too.