I will say that your arguments against human mimicry also imply that extreme actions are likely necessary to prevent misaligned AI. To put it another way, the reason the AI is simulating that world is because if we are truly doomed by having a 1 in a million chance of alignment and AGI will be built anyways, pivotal acts or extreme circumstances are required to make the future go well at all, and in most cases you die anyways despite your attempt. Now don’t get me wrong, this could happen, but right now that level of certainty isn’t warranted.
So if I can make a takeaway from this post, it’s that you think that our situation is so bad that nothing short of mathematical proof of alignment or extreme pivotal acts is sufficient. Is that right? And if so, why do you think that we have such low chances of making alignment work?
Subtle point: the key question is not how certain we are, but how certain the predictor system (e.g. GPT) is. Presumably if it’s able to generalize that far out of distribution at all, it’s likely to have enough understanding to make a pretty high-confidence guess as to whether AGI will take over or not. We humans might not know the answer very confidently, but an AI capable enough to apply the human mimicry strategy usefully is more likely to know the answer very confidently, whatever that answer is.
Still, that is very bad news for us (assuming the predictor is even close to right.) I hope it doesn’t get as pessimistic as this, but still if something like that 1 in a million chance were right, than we’d need to take far more extreme actions or just give up on the AI safety project, as it’s effectively no longer tractable.
I will say that your arguments against human mimicry also imply that extreme actions are likely necessary to prevent misaligned AI. To put it another way, the reason the AI is simulating that world is because if we are truly doomed by having a 1 in a million chance of alignment and AGI will be built anyways, pivotal acts or extreme circumstances are required to make the future go well at all, and in most cases you die anyways despite your attempt. Now don’t get me wrong, this could happen, but right now that level of certainty isn’t warranted.
So if I can make a takeaway from this post, it’s that you think that our situation is so bad that nothing short of mathematical proof of alignment or extreme pivotal acts is sufficient. Is that right? And if so, why do you think that we have such low chances of making alignment work?
Subtle point: the key question is not how certain we are, but how certain the predictor system (e.g. GPT) is. Presumably if it’s able to generalize that far out of distribution at all, it’s likely to have enough understanding to make a pretty high-confidence guess as to whether AGI will take over or not. We humans might not know the answer very confidently, but an AI capable enough to apply the human mimicry strategy usefully is more likely to know the answer very confidently, whatever that answer is.
Still, that is very bad news for us (assuming the predictor is even close to right.) I hope it doesn’t get as pessimistic as this, but still if something like that 1 in a million chance were right, than we’d need to take far more extreme actions or just give up on the AI safety project, as it’s effectively no longer tractable.