Very fun modeling post. I like imagining this future as a Pokemon game, but with unfriendly AI.
You’ve got a typo near the end saying c is drawn from “Unif(1,19)” rather than “Unif(0,1)”.
I’d be interested in a heat map visualization, where we fix (for example) a and μ, and map the probability of catastrophe as a function of c and σ.
One intuitive reason p would be increasing is if we’re imagining human supervision of questionably aligned AI, so that when the AI is dumb, humans can supervise it effectively. If there’s some subset of “bad” actions that the AI might rate highly but would e.g. manipulate the human supervisor, these actions are probably more likely to be found as x increases (the search space might expand to include more bad actions, and the AI might be better at finding available bad actions).
This messes up some of the modeling assumptions though, because it only makes sense in the domain where humans are still important. The domain of the model seems to be that there’s a bunch of superhuman, unsupervised AIs running around at the same time, and we haven’t immediately lost so we must already have done a really good job at alignment, but we still want to try this redundancy → safety thing.
So I guess with p you’re more looking at an “alignment tax” argument, where safety might be correlated with capability because selecting for safety is extra work?
I’ll probably create a post soon-ish with more visualizations covering cases like the ones you suggested.
You’re right about the model being pertinent to cases where we’ve already solved the alignment problem pretty well, but want to try other safety measures. I’m particularly thinking about cases where the AIs are so advanced that humans can’t really supervise them well, so the AIs must supervise each other. In that case, I’m not sure how p would behave as a function of AI capability. Maybe it’s best to assume that p is increasing with capability, just so we’re aware of what the worst case could be?
Very fun modeling post. I like imagining this future as a Pokemon game, but with unfriendly AI.
You’ve got a typo near the end saying c is drawn from “Unif(1,19)” rather than “Unif(0,1)”.
I’d be interested in a heat map visualization, where we fix (for example) a and μ, and map the probability of catastrophe as a function of c and σ.
One intuitive reason p would be increasing is if we’re imagining human supervision of questionably aligned AI, so that when the AI is dumb, humans can supervise it effectively. If there’s some subset of “bad” actions that the AI might rate highly but would e.g. manipulate the human supervisor, these actions are probably more likely to be found as x increases (the search space might expand to include more bad actions, and the AI might be better at finding available bad actions).
This messes up some of the modeling assumptions though, because it only makes sense in the domain where humans are still important. The domain of the model seems to be that there’s a bunch of superhuman, unsupervised AIs running around at the same time, and we haven’t immediately lost so we must already have done a really good job at alignment, but we still want to try this redundancy → safety thing.
So I guess with p you’re more looking at an “alignment tax” argument, where safety might be correlated with capability because selecting for safety is extra work?
Ah thanks for pointing out the typo.
I’ll probably create a post soon-ish with more visualizations covering cases like the ones you suggested.
You’re right about the model being pertinent to cases where we’ve already solved the alignment problem pretty well, but want to try other safety measures. I’m particularly thinking about cases where the AIs are so advanced that humans can’t really supervise them well, so the AIs must supervise each other. In that case, I’m not sure how p would behave as a function of AI capability. Maybe it’s best to assume that p is increasing with capability, just so we’re aware of what the worst case could be?