So here’s my concern: there can be weights that contribute towards the base objective some of the time but not all of the time. For instance it could be that the check for “am I in deployment?” happens in a different part of the model each time it’s run (using the inputs as a seed to determine where that happens). In this case there is a conditional statement that’s never evaluating to true, but because it happens in a different place in the circuit each time decaying weights don’t really penalize it.
Isn’t this just pushing the problem back a step? Wouldn’t weight decay penalize the weights that compute the input-based pseudorandom number to determine where the deployment check happens?
This also just seems like it’d be really low measure in the SGD prior. Like, where is the optimization pressure coming from to form all these independent deployment-checking circuits throughout the model and also randomize their use? All the while taking a hit on the L2 penalty for doing this?
Is the idea is that the network itself is consciously modeling SGD and gradient-hacking its way toward this solution? If so: 1) it’s not clear to me this is mechanistically possible, and 2) if the AI is superintelligent enough to figure this out, it probably figure out a better way to get what it wants (e.g. break out of the training process altogether).
^ I stand by the substantive points I made above but it occurs to me that I expressed them in a weirdly combative and dismissive tone, so sorry about that.
No worries! I am worried about gradient-hacking, but am also happy to assume that doesn’t happen here (if you’ve already got a gradient-hacking deceptive AI you’ve probably lost already).
Here’s an example of how you might naturally get the deployment check “moving” around the model:
Suppose the model learns to be an unrolled interpreter for some language and implements all its logic in that language. That is, the circuit is formed of repeating units that implement the process of interpreting individual lines of a program in some (model-invented) language. This isn’t so farfetched: it could be quite useful to be able to work with higher-level abstractions and an interpreter lets you do that.
Suppose that that language supports things like loops and conditionals.
Depending on the inputs encountered, the model might spend more or less time running a particular loop. This shifts which piece of the underlying circuit ends up interpreting the deployment check.
So without any deceptive alignment or gradient hacking, the model can have the deployment check occur on different parts of the interpreter.
So here’s my concern: there can be weights that contribute towards the base objective some of the time but not all of the time. For instance it could be that the check for “am I in deployment?” happens in a different part of the model each time it’s run (using the inputs as a seed to determine where that happens). In this case there is a conditional statement that’s never evaluating to true, but because it happens in a different place in the circuit each time decaying weights don’t really penalize it.
Isn’t this just pushing the problem back a step? Wouldn’t weight decay penalize the weights that compute the input-based pseudorandom number to determine where the deployment check happens?
This also just seems like it’d be really low measure in the SGD prior. Like, where is the optimization pressure coming from to form all these independent deployment-checking circuits throughout the model and also randomize their use? All the while taking a hit on the L2 penalty for doing this?
Is the idea is that the network itself is consciously modeling SGD and gradient-hacking its way toward this solution? If so: 1) it’s not clear to me this is mechanistically possible, and 2) if the AI is superintelligent enough to figure this out, it probably figure out a better way to get what it wants (e.g. break out of the training process altogether).
^ I stand by the substantive points I made above but it occurs to me that I expressed them in a weirdly combative and dismissive tone, so sorry about that.
No worries! I am worried about gradient-hacking, but am also happy to assume that doesn’t happen here (if you’ve already got a gradient-hacking deceptive AI you’ve probably lost already).
Here’s an example of how you might naturally get the deployment check “moving” around the model:
Suppose the model learns to be an unrolled interpreter for some language and implements all its logic in that language. That is, the circuit is formed of repeating units that implement the process of interpreting individual lines of a program in some (model-invented) language. This isn’t so farfetched: it could be quite useful to be able to work with higher-level abstractions and an interpreter lets you do that.
Suppose that that language supports things like loops and conditionals.
Depending on the inputs encountered, the model might spend more or less time running a particular loop. This shifts which piece of the underlying circuit ends up interpreting the deployment check.
So without any deceptive alignment or gradient hacking, the model can have the deployment check occur on different parts of the interpreter.