Thomas Kwa comments on D0TheMath’s Shortform

Thomas Kwa 14 Sep 2023 1:34 UTC
4 points
Not in isolation, but that’s just because characterizing the ultimate goal / optimization target of a system is way too difficult for the field right now. I think the important question is whether interp brings us closer such that in conjunction with more theory and/or the ability to iterate, we can get some alignment and/or corrigibility properties.
I haven’t read the paper and I’m not claiming that this will be counterfactual to some huge breakthrough, but understanding in-context learning algorithms definitely seems like a piece of the puzzle. To give a fanciful story from my skim, the paper says that the model constructs an internal training set. Say we have a technique to excise power-seeking behavior from models by removing the influence of certain training examples. If the model’s mesa-optimization algorithms operate differently, our technique might not work until we understand this and adapt the technique. Or we can edit the internal training set directly rather than trying to indirectly influence it.