This is a post I wrote on my personal blog after a discussion with a deep learning professor at the University of Chicago. I don’t know if this particular topic has been studied in much depth elsewhere, so I figured I would share it here. If you know of any related work (or have any other comments on this, of course), let me know.
tldr: Supposing that someone figures out how to make an AGI and solve the alignment problem at least to some degree, would it be helpful to distribute many copies of the AI and have them keep each other in check? I propose a mathematical model to think about this question and conclude that this would probably be helpful at first, but would probably diminish in effectiveness over time. Under some assumptions, this approach may actually be counterproductive.
Very fun modeling post. I like imagining this future as a Pokemon game, but with unfriendly AI.
You’ve got a typo near the end saying c is drawn from “Unif(1,19)” rather than “Unif(0,1)”.
I’d be interested in a heat map visualization, where we fix (for example) a and μ, and map the probability of catastrophe as a function of c and σ.
One intuitive reason p would be increasing is if we’re imagining human supervision of questionably aligned AI, so that when the AI is dumb, humans can supervise it effectively. If there’s some subset of “bad” actions that the AI might rate highly but would e.g. manipulate the human supervisor, these actions are probably more likely to be found as x increases (the search space might expand to include more bad actions, and the AI might be better at finding available bad actions).
This messes up some of the modeling assumptions though, because it only makes sense in the domain where humans are still important. The domain of the model seems to be that there’s a bunch of superhuman, unsupervised AIs running around at the same time, and we haven’t immediately lost so we must already have done a really good job at alignment, but we still want to try this redundancy → safety thing.
So I guess with p you’re more looking at an “alignment tax” argument, where safety might be correlated with capability because selecting for safety is extra work?
Ah thanks for pointing out the typo.
I’ll probably create a post soon-ish with more visualizations covering cases like the ones you suggested.
You’re right about the model being pertinent to cases where we’ve already solved the alignment problem pretty well, but want to try other safety measures. I’m particularly thinking about cases where the AIs are so advanced that humans can’t really supervise them well, so the AIs must supervise each other. In that case, I’m not sure how p would behave as a function of AI capability. Maybe it’s best to assume that p is increasing with capability, just so we’re aware of what the worst case could be?
Nice! I expect I’ll be citing/referencing this at some point in the future. At some point I intend to have my own model of all this, and you’ve provided a lot of material to draw from.