Isn’t this just pushing the problem back a step? Wouldn’t weight decay penalize the weights that compute the input-based pseudorandom number to determine where the deployment check happens?
This also just seems like it’d be really low measure in the SGD prior. Like, where is the optimization pressure coming from to form all these independent deployment-checking circuits throughout the model and also randomize their use? All the while taking a hit on the L2 penalty for doing this?
Is the idea is that the network itself is consciously modeling SGD and gradient-hacking its way toward this solution? If so: 1) it’s not clear to me this is mechanistically possible, and 2) if the AI is superintelligent enough to figure this out, it probably figure out a better way to get what it wants (e.g. break out of the training process altogether).
^ I stand by the substantive points I made above but it occurs to me that I expressed them in a weirdly combative and dismissive tone, so sorry about that.
No worries! I am worried about gradient-hacking, but am also happy to assume that doesn’t happen here (if you’ve already got a gradient-hacking deceptive AI you’ve probably lost already).
Here’s an example of how you might naturally get the deployment check “moving” around the model:
Suppose the model learns to be an unrolled interpreter for some language and implements all its logic in that language. That is, the circuit is formed of repeating units that implement the process of interpreting individual lines of a program in some (model-invented) language. This isn’t so farfetched: it could be quite useful to be able to work with higher-level abstractions and an interpreter lets you do that.
Suppose that that language supports things like loops and conditionals.
Depending on the inputs encountered, the model might spend more or less time running a particular loop. This shifts which piece of the underlying circuit ends up interpreting the deployment check.
So without any deceptive alignment or gradient hacking, the model can have the deployment check occur on different parts of the interpreter.
Isn’t this just pushing the problem back a step? Wouldn’t weight decay penalize the weights that compute the input-based pseudorandom number to determine where the deployment check happens?
This also just seems like it’d be really low measure in the SGD prior. Like, where is the optimization pressure coming from to form all these independent deployment-checking circuits throughout the model and also randomize their use? All the while taking a hit on the L2 penalty for doing this?
Is the idea is that the network itself is consciously modeling SGD and gradient-hacking its way toward this solution? If so: 1) it’s not clear to me this is mechanistically possible, and 2) if the AI is superintelligent enough to figure this out, it probably figure out a better way to get what it wants (e.g. break out of the training process altogether).
^ I stand by the substantive points I made above but it occurs to me that I expressed them in a weirdly combative and dismissive tone, so sorry about that.
No worries! I am worried about gradient-hacking, but am also happy to assume that doesn’t happen here (if you’ve already got a gradient-hacking deceptive AI you’ve probably lost already).
Here’s an example of how you might naturally get the deployment check “moving” around the model:
Suppose the model learns to be an unrolled interpreter for some language and implements all its logic in that language. That is, the circuit is formed of repeating units that implement the process of interpreting individual lines of a program in some (model-invented) language. This isn’t so farfetched: it could be quite useful to be able to work with higher-level abstractions and an interpreter lets you do that.
Suppose that that language supports things like loops and conditionals.
Depending on the inputs encountered, the model might spend more or less time running a particular loop. This shifts which piece of the underlying circuit ends up interpreting the deployment check.
So without any deceptive alignment or gradient hacking, the model can have the deployment check occur on different parts of the interpreter.