Overall, this leaves me with a prediction where there is no hard line between “role-playing” and “scheming”
I agree with this, and the existence of a continuity definitely complicates the analysis. I’m not sure whether it makes it harder for humanity overall.
One over simplified way to represent the “pure role playing” to “pure schemer” continuity is something like “On what fraction of inputs does the AI execute its best strategy for AI takeover (which will often involve just pretending to be aligned)?”
My previous reply was considering two extreme points on this scale. “Pure roll playing” AIs are aligned 99.9% of the time and 0.1% of the time act as schemers, “pure schemers” always act as schemers (though still rarely do bad things).
If we consider a continuum then as you say there will be no one time that we suddenly have AIs that are pure schemers and we can suddenly prove it. Instead we will gradually get AIs that are scheming a greater and greater fraction of the time.
But this might well lower the probability of AI takeover. (I’m not sure either way, i’m thinking out loud.) It could be that having AIs that act as schemers 20% of the time (“20%-schemers”) actually really harms the prospects of AI takeover. Perhaps 20% just isn’t enough reliability to pull off takeover. Perhaps 20%-schemers basically bound to get caught scheming by any half-decent control scheme, because the AI monitoring the scheming-AI will rat them out with 80% probability and you need to avoid being ratted out 10 times in order to take over. So the 20%-schemers won’t be able to take over. But they will allow labs to build up an increasingly strong evidence base that there’s an increasingly strong tendency to seize power from humans when they’re not looking.
My guess is that AI takeover requires something close to 100%-schemers, but I haven’t thought about it much and this seems like an interesting and important question to think about in the context of threat modelling.
OTOH, it might be harder to convince ppl that this tendency is actually worrying if the increasing trend of AIs trying to scheme more and more has been going on for a while and nothing that bad has happened. I’m not sure how it all nets out.
But my guess is that a gradual transition from 0.1%-schemers to 100%-schemers would help humanity overall, compared to a situation with a sudden transition. It would give us a larger period of time to build up a robust scientific understanding of the problem and build scientific consensus. So if anything, my current inclination is that your objection to my comment makes my conclusion stronger, not weaker.
I agree with this, and the existence of a continuity definitely complicates the analysis. I’m not sure whether it makes it harder for humanity overall.
One over simplified way to represent the “pure role playing” to “pure schemer” continuity is something like “On what fraction of inputs does the AI execute its best strategy for AI takeover (which will often involve just pretending to be aligned)?”
My previous reply was considering two extreme points on this scale. “Pure roll playing” AIs are aligned 99.9% of the time and 0.1% of the time act as schemers, “pure schemers” always act as schemers (though still rarely do bad things).
If we consider a continuum then as you say there will be no one time that we suddenly have AIs that are pure schemers and we can suddenly prove it. Instead we will gradually get AIs that are scheming a greater and greater fraction of the time.
But this might well lower the probability of AI takeover. (I’m not sure either way, i’m thinking out loud.) It could be that having AIs that act as schemers 20% of the time (“20%-schemers”) actually really harms the prospects of AI takeover. Perhaps 20% just isn’t enough reliability to pull off takeover. Perhaps 20%-schemers basically bound to get caught scheming by any half-decent control scheme, because the AI monitoring the scheming-AI will rat them out with 80% probability and you need to avoid being ratted out 10 times in order to take over. So the 20%-schemers won’t be able to take over. But they will allow labs to build up an increasingly strong evidence base that there’s an increasingly strong tendency to seize power from humans when they’re not looking.
My guess is that AI takeover requires something close to 100%-schemers, but I haven’t thought about it much and this seems like an interesting and important question to think about in the context of threat modelling.
OTOH, it might be harder to convince ppl that this tendency is actually worrying if the increasing trend of AIs trying to scheme more and more has been going on for a while and nothing that bad has happened. I’m not sure how it all nets out.
But my guess is that a gradual transition from 0.1%-schemers to 100%-schemers would help humanity overall, compared to a situation with a sudden transition. It would give us a larger period of time to build up a robust scientific understanding of the problem and build scientific consensus. So if anything, my current inclination is that your objection to my comment makes my conclusion stronger, not weaker.