I wrote up a short post with a summary of their results. It doesn’t really answer any of your questions. I do have thoughts on a couple, even though I’m not expert on interpretability.
But my main focus is on your footnote: is this going to help much with aligning “real” AGI (I’ve been looking for a term; maybe REAL stands for Reflective Entities with Agency and Learning?:). I’m of course primarily thinking of foundation models scaffolded to have goals, cognitive routines, and incorporate multiple AI systems such as an episodic memory system. I think the answer is that some of the interpretability work will be very valuable even in those systems, while some of it might be a dead end—and we haven’t really thought through which is which yet.
is this going to help much with aligning “real” AGI
I think it’s an important foundation but insufficient on its own. I think if you have an LLM that, for example, is routinely deceptive, it’s going to be hard or impossible to build an aligned system on top of that. If you have an LLM that consistently behaves well and is understandable, it’s a great start toward broader aligned systems.
I think the answer is that some of the interpretability work will be very valuable even in those systems, while some of it might be a dead end
I think that at least as important as the ability to interpret here is the ability to steer. If, for example, you can cleanly (ie based on features that crisply capture the categories we care about) steer a model away from being deceptive even if we’re handing it goals and memories that would otherwise lead to deception, that seems like it at least has the potential to be a much safer system.
I wrote up a short post with a summary of their results. It doesn’t really answer any of your questions. I do have thoughts on a couple, even though I’m not expert on interpretability.
But my main focus is on your footnote: is this going to help much with aligning “real” AGI (I’ve been looking for a term; maybe REAL stands for Reflective Entities with Agency and Learning?:). I’m of course primarily thinking of foundation models scaffolded to have goals, cognitive routines, and incorporate multiple AI systems such as an episodic memory system. I think the answer is that some of the interpretability work will be very valuable even in those systems, while some of it might be a dead end—and we haven’t really thought through which is which yet.
I think it’s an important foundation but insufficient on its own. I think if you have an LLM that, for example, is routinely deceptive, it’s going to be hard or impossible to build an aligned system on top of that. If you have an LLM that consistently behaves well and is understandable, it’s a great start toward broader aligned systems.
I think that at least as important as the ability to interpret here is the ability to steer. If, for example, you can cleanly (ie based on features that crisply capture the categories we care about) steer a model away from being deceptive even if we’re handing it goals and memories that would otherwise lead to deception, that seems like it at least has the potential to be a much safer system.