That still leaves the issue of early training, when the AGI is not yet motivated to not imagine adversaries, or not yet able. So I would say: if it does imagine the adversary, and then its goals do get hijacked, then at that point I would say “OK yes now it’s misaligned”. (Just like if a real adversary is exploiting a normal security hole—I would say the AGI is aligned before the adversary exploits that hole, and misaligned after.) Then what? Well, presumably, we will need to have procedure that verifies alignment before we release the AGI from its training box. And that procedure would presumably be indifferent to how the AGI came to be misaligned. So I don’t think that’s really a special problem we need to think about.
This part doesn’t necessarily make sense, because prevention could be easier than after-the-fact measures. In particular,
You might be unable to defend against arbitrarily adversarial cognition, so, you might want to prevent it early rather than try to detect it later, because you may be vulnerable in between.
You might be able to detect some sorts of misalignment, but not others. In particular, it might be very difficult to detect purposeful deception, since it intelligently evades whatever measures are in place. So your misalignment-detection may be dependent on averting mesa-optimizers or specific sorts of mesa-optimizers.
That’s fair. Other possible approaches are “try to ensure that imagining dangerous adversarial intelligences is aversive to the AGI-in-training ASAP, such that this motivation is installed before the AGI is able to do so”, or “intepretability that looks for the AGI imagining dangerous adversarial intelligences”.
I guess the fact that people don’t tend to get hijacked by imagined adversaries gives me some hope that the first one is feasible—like, that maybe there’s a big window where one is smart enough to understand that imagining adversarial intelligences can be bad, but not smart enough to do so with such fidelity that it actuality is dangerous.
But hard to say what’s gonna work, if anything, at least at my current stage of general ignorance about the overall training process.
I think one major reason why people don’t tend to get hijacked by imagined adversaries is that you can’t simulate someone who is smarter than you, and therefore you can defend against anything you can simulate in your mind.
This is not a perfect arugment since I can imagine someone that has power over me in the real world, and for example imagine how angry they would be at me if I did something they did not like. But then their power over me comes from their power in the real world, not their ability to outsmart me inside my own mind.
Not to disagree hugely, but I have heard one religious conversion (an enlightenment type experience) described in a way that fits with “takeover without holding power over someone”. Specifically this person described enlightenment in terms close to “I was ready to pack my things and leave. But the poison was already in me. My self died soon after that.”
It’s possible to get the general flow of the arguments another person would make, spontaneously produce those arguments later, and be convinced by them (or at least influenced).
This part doesn’t necessarily make sense, because prevention could be easier than after-the-fact measures. In particular,
You might be unable to defend against arbitrarily adversarial cognition, so, you might want to prevent it early rather than try to detect it later, because you may be vulnerable in between.
You might be able to detect some sorts of misalignment, but not others. In particular, it might be very difficult to detect purposeful deception, since it intelligently evades whatever measures are in place. So your misalignment-detection may be dependent on averting mesa-optimizers or specific sorts of mesa-optimizers.
That’s fair. Other possible approaches are “try to ensure that imagining dangerous adversarial intelligences is aversive to the AGI-in-training ASAP, such that this motivation is installed before the AGI is able to do so”, or “intepretability that looks for the AGI imagining dangerous adversarial intelligences”.
I guess the fact that people don’t tend to get hijacked by imagined adversaries gives me some hope that the first one is feasible—like, that maybe there’s a big window where one is smart enough to understand that imagining adversarial intelligences can be bad, but not smart enough to do so with such fidelity that it actuality is dangerous.
But hard to say what’s gonna work, if anything, at least at my current stage of general ignorance about the overall training process.
I think one major reason why people don’t tend to get hijacked by imagined adversaries is that you can’t simulate someone who is smarter than you, and therefore you can defend against anything you can simulate in your mind.
This is not a perfect arugment since I can imagine someone that has power over me in the real world, and for example imagine how angry they would be at me if I did something they did not like. But then their power over me comes from their power in the real world, not their ability to outsmart me inside my own mind.
Not to disagree hugely, but I have heard one religious conversion (an enlightenment type experience) described in a way that fits with “takeover without holding power over someone”. Specifically this person described enlightenment in terms close to “I was ready to pack my things and leave. But the poison was already in me. My self died soon after that.”
It’s possible to get the general flow of the arguments another person would make, spontaneously produce those arguments later, and be convinced by them (or at least influenced).