Great post! I’m looking forward to seeing future projects from Team Shard.
I’m curious why you frame channel 55 as being part of the agent’s “cheese-seeking motivation,” as opposed to simply encoding the agent’s belief about where the cheese is. Unless I’m missing something, I’d expect the latter to be as or more likely—in that when you change the cheese’s location, the thing that should straightforwardly change is the agent’s model of the cheese’s location.
In addition to what Peli said, I would consider “changes where the agent thinks the cheese is” to be part of “changing/retargeting the cheese-seeking motivation.” Ultimately, I think “cheese-seeking motivation” is shorthand for ~”a subgraph of the computational graph of a forward pass which locally attracts the agent to a target portion of the maze, where that target tracks the cheese when cheese is present.” And on that view, modifying channel 55 would be part of modifying cheese-seeking motivation.
Ultimately, “motivation” is going to reduce to non-motivational, primitive computational operations, and I think it’ll feel weird the first few times we see that happen. For example, I might wonder “where’s the motivation really at, isn’t this channel just noting where the cheese is?”.
This sequence of mental moves, where one boils talk about “motivations” or “goals” or “trying” down into non-motivational, purely mechanical circuit and feedback control patterns, and then also the reverse sequence of mental moves, where one reassembles motivational abstractions out of primitive operations, is possibly the biggest thing I wish I could get folks to learn. I think this is a pretty central pattern in “shard theory” discussions that feels missing from many other places.
I agree that motivation should reduce to low-level, primitive things, and also that changing the agent’s belief about where the cheese is lets you retarget behavior. However, I don’t expect edits to beliefs to let you scalably control what the agent does, in that if it’s smart enough and making sufficiently complicated plans you won’t have a reliable mapping from (world model state) to (abstract class of behavior executed by the agent), where when I say “abstract class of behavior” I mean things like “put the red balls in the blue basket” or “pet all the cats in the environment.”
It also seems plausible to me that there exist parts of the agent that do allow for scalable control through modification, and this is what I would refer to as “the values” (the classic example here is a utility function, though things like RL agents might not have those).
But maybe you’re studying the structure of motivational circuitry with a downstream objective other than “scalable control,” in which case this objection doesn’t necessarily apply.
However, I don’t expect edits to beliefs to let you scalably control what the agent does
Agreed.
But maybe you’re studying the structure of motivational circuitry with a downstream objective other than “scalable control,” in which case this objection doesn’t necessarily apply.
Yeah, I don’t think it’s very practical to retarget the search for AGI, and “scalable control via internal retargeting” isn’t the main thing which excited me about this line of research. I’m more interested in understanding the structure of learned motivational circuitry, and thereby having a better idea of inductive biases and how to structure training processes so as to satisfy different training goals.
I’m also interested in new interp and AI-steering techniques which derive from our results.
The main reason is that different channels that each code cheese locations (e.g. channel 42, channel 88) seem to initiate computations that each encourage cheese-pursuit conditional on slightly different conditions. We can think of each of these channels as a perceptual gate to a slightly different conditionally cheese-pursuing computation.
Great post! I’m looking forward to seeing future projects from Team Shard.
I’m curious why you frame channel 55 as being part of the agent’s “cheese-seeking motivation,” as opposed to simply encoding the agent’s belief about where the cheese is. Unless I’m missing something, I’d expect the latter to be as or more likely—in that when you change the cheese’s location, the thing that should straightforwardly change is the agent’s model of the cheese’s location.
In addition to what Peli said, I would consider “changes where the agent thinks the cheese is” to be part of “changing/retargeting the cheese-seeking motivation.” Ultimately, I think “cheese-seeking motivation” is shorthand for ~”a subgraph of the computational graph of a forward pass which locally attracts the agent to a target portion of the maze, where that target tracks the cheese when cheese is present.” And on that view, modifying channel 55 would be part of modifying cheese-seeking motivation.
Ultimately, “motivation” is going to reduce to non-motivational, primitive computational operations, and I think it’ll feel weird the first few times we see that happen. For example, I might wonder “where’s the motivation really at, isn’t this channel just noting where the cheese is?”.
This sequence of mental moves, where one boils talk about “motivations” or “goals” or “trying” down into non-motivational, purely mechanical circuit and feedback control patterns, and then also the reverse sequence of mental moves, where one reassembles motivational abstractions out of primitive operations, is possibly the biggest thing I wish I could get folks to learn. I think this is a pretty central pattern in “shard theory” discussions that feels missing from many other places.
I agree that motivation should reduce to low-level, primitive things, and also that changing the agent’s belief about where the cheese is lets you retarget behavior. However, I don’t expect edits to beliefs to let you scalably control what the agent does, in that if it’s smart enough and making sufficiently complicated plans you won’t have a reliable mapping from (world model state) to (abstract class of behavior executed by the agent), where when I say “abstract class of behavior” I mean things like “put the red balls in the blue basket” or “pet all the cats in the environment.”
It also seems plausible to me that there exist parts of the agent that do allow for scalable control through modification, and this is what I would refer to as “the values” (the classic example here is a utility function, though things like RL agents might not have those).
But maybe you’re studying the structure of motivational circuitry with a downstream objective other than “scalable control,” in which case this objection doesn’t necessarily apply.
Agreed.
Yeah, I don’t think it’s very practical to retarget the search for AGI, and “scalable control via internal retargeting” isn’t the main thing which excited me about this line of research. I’m more interested in understanding the structure of learned motivational circuitry, and thereby having a better idea of inductive biases and how to structure training processes so as to satisfy different training goals.
I’m also interested in new interp and AI-steering techniques which derive from our results.
The main reason is that different channels that each code cheese locations (e.g. channel 42, channel 88) seem to initiate computations that each encourage cheese-pursuit conditional on slightly different conditions. We can think of each of these channels as a perceptual gate to a slightly different conditionally cheese-pursuing computation.