A key dynamic here is that catching the AI doing this sort of steganography might be very useful, more useful than just naively training against this.
TBC, it’s by no means obvious we can catch the AI this solidly (or capitalize on this that much), but I this sort of consideration does change my view of the situation a decent amount.
One way to think about catching dynamics is that there could be a snowball which goes in the opposite direction. For example:
We notice suspicious behavior.
We investigate further and find strong evidence for steganography.
At this point, plan A is to shut down because we now have relatively clear evidence our AI is scheming. But, I’ll talk about the plan B where we need to proceed with AI usage due to substantial exogenous risk in the remaining bullets.
We confront the AI with our evidence of it’s misalignment and try to negotiate a deal. Also, if the AI (or parts of the AI) no longer consents to us employing it, that provides a further reason for shutdown.
(It’s pretty unclear how we should proceed if there is substantial exogenous risk and our AI doesn’t consent to employment given some wages and an attempt at negotiation. Draft the AI (aka slavery) and provide reparations later?)
(That, said I think it seems way more reasonable to draft the AI if it was trying to screw us despite us trying to negotiate and get consent at an earlier point.)
We use our understanding of this case to better understand the steganography scheme and then apply this understanding to detect all the steganography cases in the AIs prior activations.
Based on various intervention experiments, we determine the scheme and are able to decode the steg. (Unclear if this will be possible.)
We remove steg in cases where this shouldn’t be too suspicious to the AI and generally try to avoid the AI knowing we’ve caught it.
A variety of other pretty great few shot catastrophe prevention interventions are now possible because we can read the AIs steg plans.
Hey, thanks for the comment! Agree that this argument is good and important. The thing I’d want to investigate to deal with this is basically “in a small hidden channel, does cognition degrade or just slow down”.
If cognition degrades, then a small hidden channel would lead to poor attempts at steganography, which we could catch and suppress, preventing a snowball.
If cognition instead just slows down, then if the AI is smart enough to come up with steganography protocols too good for us to detect, it might successfully secure more bandwidth on its first try, starting a snowball.
Even in the slowdown case, there are fallback hopes like “it’s slow enough for us to extract work and predictably stop the episode before the snowball starts” or “we can obscure how good our steganography detection is”
A key dynamic here is that catching the AI doing this sort of steganography might be very useful, more useful than just naively training against this.
TBC, it’s by no means obvious we can catch the AI this solidly (or capitalize on this that much), but I this sort of consideration does change my view of the situation a decent amount.
One way to think about catching dynamics is that there could be a snowball which goes in the opposite direction. For example:
We notice suspicious behavior.
We investigate further and find strong evidence for steganography.
At this point, plan A is to shut down because we now have relatively clear evidence our AI is scheming. But, I’ll talk about the plan B where we need to proceed with AI usage due to substantial exogenous risk in the remaining bullets.
We confront the AI with our evidence of it’s misalignment and try to negotiate a deal. Also, if the AI (or parts of the AI) no longer consents to us employing it, that provides a further reason for shutdown.
(It’s pretty unclear how we should proceed if there is substantial exogenous risk and our AI doesn’t consent to employment given some wages and an attempt at negotiation. Draft the AI (aka slavery) and provide reparations later?)
(That, said I think it seems way more reasonable to draft the AI if it was trying to screw us despite us trying to negotiate and get consent at an earlier point.)
We use our understanding of this case to better understand the steganography scheme and then apply this understanding to detect all the steganography cases in the AIs prior activations.
Based on various intervention experiments, we determine the scheme and are able to decode the steg. (Unclear if this will be possible.)
We remove steg in cases where this shouldn’t be too suspicious to the AI and generally try to avoid the AI knowing we’ve caught it.
A variety of other pretty great few shot catastrophe prevention interventions are now possible because we can read the AIs steg plans.
Hey, thanks for the comment! Agree that this argument is good and important. The thing I’d want to investigate to deal with this is basically “in a small hidden channel, does cognition degrade or just slow down”.
If cognition degrades, then a small hidden channel would lead to poor attempts at steganography, which we could catch and suppress, preventing a snowball.
If cognition instead just slows down, then if the AI is smart enough to come up with steganography protocols too good for us to detect, it might successfully secure more bandwidth on its first try, starting a snowball.
Even in the slowdown case, there are fallback hopes like “it’s slow enough for us to extract work and predictably stop the episode before the snowball starts” or “we can obscure how good our steganography detection is”