That’s fair. Other possible approaches are “try to ensure that imagining dangerous adversarial intelligences is aversive to the AGI-in-training ASAP, such that this motivation is installed before the AGI is able to do so”, or “intepretability that looks for the AGI imagining dangerous adversarial intelligences”.
I guess the fact that people don’t tend to get hijacked by imagined adversaries gives me some hope that the first one is feasible—like, that maybe there’s a big window where one is smart enough to understand that imagining adversarial intelligences can be bad, but not smart enough to do so with such fidelity that it actuality is dangerous.
But hard to say what’s gonna work, if anything, at least at my current stage of general ignorance about the overall training process.
I think one major reason why people don’t tend to get hijacked by imagined adversaries is that you can’t simulate someone who is smarter than you, and therefore you can defend against anything you can simulate in your mind.
This is not a perfect arugment since I can imagine someone that has power over me in the real world, and for example imagine how angry they would be at me if I did something they did not like. But then their power over me comes from their power in the real world, not their ability to outsmart me inside my own mind.
Not to disagree hugely, but I have heard one religious conversion (an enlightenment type experience) described in a way that fits with “takeover without holding power over someone”. Specifically this person described enlightenment in terms close to “I was ready to pack my things and leave. But the poison was already in me. My self died soon after that.”
It’s possible to get the general flow of the arguments another person would make, spontaneously produce those arguments later, and be convinced by them (or at least influenced).
That’s fair. Other possible approaches are “try to ensure that imagining dangerous adversarial intelligences is aversive to the AGI-in-training ASAP, such that this motivation is installed before the AGI is able to do so”, or “intepretability that looks for the AGI imagining dangerous adversarial intelligences”.
I guess the fact that people don’t tend to get hijacked by imagined adversaries gives me some hope that the first one is feasible—like, that maybe there’s a big window where one is smart enough to understand that imagining adversarial intelligences can be bad, but not smart enough to do so with such fidelity that it actuality is dangerous.
But hard to say what’s gonna work, if anything, at least at my current stage of general ignorance about the overall training process.
I think one major reason why people don’t tend to get hijacked by imagined adversaries is that you can’t simulate someone who is smarter than you, and therefore you can defend against anything you can simulate in your mind.
This is not a perfect arugment since I can imagine someone that has power over me in the real world, and for example imagine how angry they would be at me if I did something they did not like. But then their power over me comes from their power in the real world, not their ability to outsmart me inside my own mind.
Not to disagree hugely, but I have heard one religious conversion (an enlightenment type experience) described in a way that fits with “takeover without holding power over someone”. Specifically this person described enlightenment in terms close to “I was ready to pack my things and leave. But the poison was already in me. My self died soon after that.”
It’s possible to get the general flow of the arguments another person would make, spontaneously produce those arguments later, and be convinced by them (or at least influenced).