I agree that motivation should reduce to low-level, primitive things, and also that changing the agent’s belief about where the cheese is lets you retarget behavior. However, I don’t expect edits to beliefs to let you scalably control what the agent does, in that if it’s smart enough and making sufficiently complicated plans you won’t have a reliable mapping from (world model state) to (abstract class of behavior executed by the agent), where when I say “abstract class of behavior” I mean things like “put the red balls in the blue basket” or “pet all the cats in the environment.”
It also seems plausible to me that there exist parts of the agent that do allow for scalable control through modification, and this is what I would refer to as “the values” (the classic example here is a utility function, though things like RL agents might not have those).
But maybe you’re studying the structure of motivational circuitry with a downstream objective other than “scalable control,” in which case this objection doesn’t necessarily apply.
However, I don’t expect edits to beliefs to let you scalably control what the agent does
Agreed.
But maybe you’re studying the structure of motivational circuitry with a downstream objective other than “scalable control,” in which case this objection doesn’t necessarily apply.
Yeah, I don’t think it’s very practical to retarget the search for AGI, and “scalable control via internal retargeting” isn’t the main thing which excited me about this line of research. I’m more interested in understanding the structure of learned motivational circuitry, and thereby having a better idea of inductive biases and how to structure training processes so as to satisfy different training goals.
I’m also interested in new interp and AI-steering techniques which derive from our results.
I agree that motivation should reduce to low-level, primitive things, and also that changing the agent’s belief about where the cheese is lets you retarget behavior. However, I don’t expect edits to beliefs to let you scalably control what the agent does, in that if it’s smart enough and making sufficiently complicated plans you won’t have a reliable mapping from (world model state) to (abstract class of behavior executed by the agent), where when I say “abstract class of behavior” I mean things like “put the red balls in the blue basket” or “pet all the cats in the environment.”
It also seems plausible to me that there exist parts of the agent that do allow for scalable control through modification, and this is what I would refer to as “the values” (the classic example here is a utility function, though things like RL agents might not have those).
But maybe you’re studying the structure of motivational circuitry with a downstream objective other than “scalable control,” in which case this objection doesn’t necessarily apply.
Agreed.
Yeah, I don’t think it’s very practical to retarget the search for AGI, and “scalable control via internal retargeting” isn’t the main thing which excited me about this line of research. I’m more interested in understanding the structure of learned motivational circuitry, and thereby having a better idea of inductive biases and how to structure training processes so as to satisfy different training goals.
I’m also interested in new interp and AI-steering techniques which derive from our results.