I think that particularly the first of these two results is pretty mind-blowing, in that it demonstrates an extremely simple and straightforward procedure for directly modifying the learned knowledge of transformer-based language models. That being said, it’s the second result that probably has the most concrete safety applications—if it can actually be scaled up to remove all the relevant knowledge—since something like that could eventually be used to ensure that a microscope AI isn’t modeling humans or ensure that an agent is myopic in the sense that it isn’t modeling the future.
Despite agreeing that the results are impressive, I’m less optimistic that you are for this path to microscope and/or myopia. Doing so would require an exhaustive listing of what we don’t want the model to know (like human modeling or human manipulation) and a way of deleting that knowledge that doesn’t break the whole network. The first requirement seems a deal-breaker to me, and I’m not convinced this work actual provide much evidence that more advanced knowledge can be removed that way.
Furthermore, the specific procedure used suggests that transformer-based language models might be a lot less inscrutable than previously thought: if we can really just think about the feed-forward layers as encoding simple key-value knowledge pairs literally in the language of the original embedding layer (as I think is also independently suggested by “interpreting GPT: the logit lens”), that provides an extremely useful and structured picture of how transformer-based language models work internally.
Here too, I agree with the sentiment, but I’m not convinced that this is the whole story. This looks like how structured facts are learned, but I see no way as of now to generate the range of stuff GPT-3 and other LMs can do from just key-value knowledge pairs.
Despite agreeing that the results are impressive, I’m less optimistic that you are for this path to microscope and/or myopia. Doing so would require an exhaustive listing of what we don’t want the model to know (like human modeling or human manipulation) and a way of deleting that knowledge that doesn’t break the whole network. The first requirement seems a deal-breaker to me, and I’m not convinced this work actual provide much evidence that more advanced knowledge can be removed that way.
Here too, I agree with the sentiment, but I’m not convinced that this is the whole story. This looks like how structured facts are learned, but I see no way as of now to generate the range of stuff GPT-3 and other LMs can do from just key-value knowledge pairs.