No worries! I am worried about gradient-hacking, but am also happy to assume that doesn’t happen here (if you’ve already got a gradient-hacking deceptive AI you’ve probably lost already).
Here’s an example of how you might naturally get the deployment check “moving” around the model:
Suppose the model learns to be an unrolled interpreter for some language and implements all its logic in that language. That is, the circuit is formed of repeating units that implement the process of interpreting individual lines of a program in some (model-invented) language. This isn’t so farfetched: it could be quite useful to be able to work with higher-level abstractions and an interpreter lets you do that.
Suppose that that language supports things like loops and conditionals.
Depending on the inputs encountered, the model might spend more or less time running a particular loop. This shifts which piece of the underlying circuit ends up interpreting the deployment check.
So without any deceptive alignment or gradient hacking, the model can have the deployment check occur on different parts of the interpreter.
No worries! I am worried about gradient-hacking, but am also happy to assume that doesn’t happen here (if you’ve already got a gradient-hacking deceptive AI you’ve probably lost already).
Here’s an example of how you might naturally get the deployment check “moving” around the model:
Suppose the model learns to be an unrolled interpreter for some language and implements all its logic in that language. That is, the circuit is formed of repeating units that implement the process of interpreting individual lines of a program in some (model-invented) language. This isn’t so farfetched: it could be quite useful to be able to work with higher-level abstractions and an interpreter lets you do that.
Suppose that that language supports things like loops and conditionals.
Depending on the inputs encountered, the model might spend more or less time running a particular loop. This shifts which piece of the underlying circuit ends up interpreting the deployment check.
So without any deceptive alignment or gradient hacking, the model can have the deployment check occur on different parts of the interpreter.