However, I don’t expect edits to beliefs to let you scalably control what the agent does
Agreed.
But maybe you’re studying the structure of motivational circuitry with a downstream objective other than “scalable control,” in which case this objection doesn’t necessarily apply.
Yeah, I don’t think it’s very practical to retarget the search for AGI, and “scalable control via internal retargeting” isn’t the main thing which excited me about this line of research. I’m more interested in understanding the structure of learned motivational circuitry, and thereby having a better idea of inductive biases and how to structure training processes so as to satisfy different training goals.
I’m also interested in new interp and AI-steering techniques which derive from our results.
Agreed.
Yeah, I don’t think it’s very practical to retarget the search for AGI, and “scalable control via internal retargeting” isn’t the main thing which excited me about this line of research. I’m more interested in understanding the structure of learned motivational circuitry, and thereby having a better idea of inductive biases and how to structure training processes so as to satisfy different training goals.
I’m also interested in new interp and AI-steering techniques which derive from our results.