That’s a neat idea, sort of the dual to the “reinforce pathways taken during training” implementation. I don’t think it resolves the big issue though, namely that it’s easy to hide from this sort of approach by making every weight involved in one way or another.
Sure, but if those small weights don’t contribute to the base objective they would just get pushed all the way to zero, right? Especially if you use L1 instead of L2 regularization. The somewhat scarier version of deception IMO is where there’s a circuit that does contribute to base task performance, but it just also has this property of being deceptive, and the loss landscape is such that SGD can‘t/won’t find a nearby non deceptive alternative with similar or better performance. But it seems like there’s some hope there too, since we know that strictly convex valleys are really low measure in NN loss landscapes, and every independent SGD solution is part of a single connected manifold of low loss. Furthermore for sufficiently large models you can actually just take a convex combination of the weights of two good models and usually get another good model. SGD definitely can find nondeceptive solutions, I guess the question is whether it will do so, and if we can push it in that direction if needed. My intuition currently is that deception isn’t actually going to be a problem except, perhaps, if the network is very deep / has recurrence. We should be worried that people will use very deep recurrent networks to build AGI, but I’m somewhat hopeful that chain of thought language models will sort of save us here since they force the network to store its intermediate computations in a human interpretable format,
So here’s my concern: there can be weights that contribute towards the base objective some of the time but not all of the time. For instance it could be that the check for “am I in deployment?” happens in a different part of the model each time it’s run (using the inputs as a seed to determine where that happens). In this case there is a conditional statement that’s never evaluating to true, but because it happens in a different place in the circuit each time decaying weights don’t really penalize it.
Isn’t this just pushing the problem back a step? Wouldn’t weight decay penalize the weights that compute the input-based pseudorandom number to determine where the deployment check happens?
This also just seems like it’d be really low measure in the SGD prior. Like, where is the optimization pressure coming from to form all these independent deployment-checking circuits throughout the model and also randomize their use? All the while taking a hit on the L2 penalty for doing this?
Is the idea is that the network itself is consciously modeling SGD and gradient-hacking its way toward this solution? If so: 1) it’s not clear to me this is mechanistically possible, and 2) if the AI is superintelligent enough to figure this out, it probably figure out a better way to get what it wants (e.g. break out of the training process altogether).
^ I stand by the substantive points I made above but it occurs to me that I expressed them in a weirdly combative and dismissive tone, so sorry about that.
No worries! I am worried about gradient-hacking, but am also happy to assume that doesn’t happen here (if you’ve already got a gradient-hacking deceptive AI you’ve probably lost already).
Here’s an example of how you might naturally get the deployment check “moving” around the model:
Suppose the model learns to be an unrolled interpreter for some language and implements all its logic in that language. That is, the circuit is formed of repeating units that implement the process of interpreting individual lines of a program in some (model-invented) language. This isn’t so farfetched: it could be quite useful to be able to work with higher-level abstractions and an interpreter lets you do that.
Suppose that that language supports things like loops and conditionals.
Depending on the inputs encountered, the model might spend more or less time running a particular loop. This shifts which piece of the underlying circuit ends up interpreting the deployment check.
So without any deceptive alignment or gradient hacking, the model can have the deployment check occur on different parts of the interpreter.
That’s a neat idea, sort of the dual to the “reinforce pathways taken during training” implementation. I don’t think it resolves the big issue though, namely that it’s easy to hide from this sort of approach by making every weight involved in one way or another.
Sure, but if those small weights don’t contribute to the base objective they would just get pushed all the way to zero, right? Especially if you use L1 instead of L2 regularization. The somewhat scarier version of deception IMO is where there’s a circuit that does contribute to base task performance, but it just also has this property of being deceptive, and the loss landscape is such that SGD can‘t/won’t find a nearby non deceptive alternative with similar or better performance. But it seems like there’s some hope there too, since we know that strictly convex valleys are really low measure in NN loss landscapes, and every independent SGD solution is part of a single connected manifold of low loss. Furthermore for sufficiently large models you can actually just take a convex combination of the weights of two good models and usually get another good model. SGD definitely can find nondeceptive solutions, I guess the question is whether it will do so, and if we can push it in that direction if needed. My intuition currently is that deception isn’t actually going to be a problem except, perhaps, if the network is very deep / has recurrence. We should be worried that people will use very deep recurrent networks to build AGI, but I’m somewhat hopeful that chain of thought language models will sort of save us here since they force the network to store its intermediate computations in a human interpretable format,
So here’s my concern: there can be weights that contribute towards the base objective some of the time but not all of the time. For instance it could be that the check for “am I in deployment?” happens in a different part of the model each time it’s run (using the inputs as a seed to determine where that happens). In this case there is a conditional statement that’s never evaluating to true, but because it happens in a different place in the circuit each time decaying weights don’t really penalize it.
Isn’t this just pushing the problem back a step? Wouldn’t weight decay penalize the weights that compute the input-based pseudorandom number to determine where the deployment check happens?
This also just seems like it’d be really low measure in the SGD prior. Like, where is the optimization pressure coming from to form all these independent deployment-checking circuits throughout the model and also randomize their use? All the while taking a hit on the L2 penalty for doing this?
Is the idea is that the network itself is consciously modeling SGD and gradient-hacking its way toward this solution? If so: 1) it’s not clear to me this is mechanistically possible, and 2) if the AI is superintelligent enough to figure this out, it probably figure out a better way to get what it wants (e.g. break out of the training process altogether).
^ I stand by the substantive points I made above but it occurs to me that I expressed them in a weirdly combative and dismissive tone, so sorry about that.
No worries! I am worried about gradient-hacking, but am also happy to assume that doesn’t happen here (if you’ve already got a gradient-hacking deceptive AI you’ve probably lost already).
Here’s an example of how you might naturally get the deployment check “moving” around the model:
Suppose the model learns to be an unrolled interpreter for some language and implements all its logic in that language. That is, the circuit is formed of repeating units that implement the process of interpreting individual lines of a program in some (model-invented) language. This isn’t so farfetched: it could be quite useful to be able to work with higher-level abstractions and an interpreter lets you do that.
Suppose that that language supports things like loops and conditionals.
Depending on the inputs encountered, the model might spend more or less time running a particular loop. This shifts which piece of the underlying circuit ends up interpreting the deployment check.
So without any deceptive alignment or gradient hacking, the model can have the deployment check occur on different parts of the interpreter.