Question for anyone, but particularly interested in hearing from Christiano, Shah, or Ngo: any thoughts on what happens when alignment schemes that worked in lower-capability regimes fail to generalize to higher-capability regimes?
For example, you could imagine a spectrum of outcomes from “no generalization” (illustrative example: galaxies tiled with paperclips) to “some generalization” (illustrative example: galaxies tiled with “hedonium” human-ish happiness-brainware) to “enough generalization that existing humans recognizably survive, but something still went wrong from our current perspective” (illustrative examples: “Failed Utopia #4-2”, Friendship Is Optimal, “With Folded Hands”). Given that not every biological civilization solves the problem, what does the rest of the multiverse look like? (How is measure distributed on something like my example spectrum, or whatever I should have typed instead?)
When alignment schemes fail to scale, I think it typically means that they work while the system is unable to overpower/outsmart the oversight process, and then break down when the system becomes able to do so. I think that this usually results in the AI shifting from behavior that is mostly constrained by the training process to behavior that is mostly unconstrained (once they effectively disempower humans).
I think the results are relatively unlikely to be good in virtue of “the AI internalized something about our values, just not everything”, and I’m pretty skeptical of recognizable “near miss” scenarios rather than AI gradually careening in very hard-to-predict directions with minimal connection with the surface features of the training process.
Overall I think that the most likely outcome is a universe that is orthogonal to anything we directly care about, maybe with a vaguely similar flavor owing to convergence depending on how AI motivations shake out. (But likely not close enough to feel great, and quite plausibly with almost no visible relation. Probably much more different from us than we are from aliens.)
I think it’s fairly plausible that the results are OK just because of galaxy-brained considerations about cooperation and niceness, where we might have been in the AI’s shoes and part of being a good cosmic citizen is not worrying too much about who gets to do what they want with the universe. That said, I’d put that at <50% chance, with uncertainty both over empirical questions of how the AI is trained and what kind of psychology that leads to, and very hard problems in moral philosophy.
It’s also fairly plausible to me (maybe also ~50%) that such systems will care enough about humans to give them a tiny slice (e.g. 1e-30) of the universe, whether as part of complicated trade with other civilizations who didn’t mess up alignment so much (or nicer AIs, or whatever other nice civilization is willing to spare 1e-30 of their endowment to help us out), or just because you don’t have to care very much at all. But of course it’s also plausible that destructive conflict between aggressive civilizations leads to horrifying outcomes for us, so it’s not all roses (and even when everything is good those trades do come at real costs).
Don’t have strong views on any of those questions; they seem important but not closely related to my day job so I haven’t thought about them too much.
But of course it’s also plausible that destructive conflict between aggressive civilizations leads to horrifying outcomes for us
Also, wouldn’t you expect s-risks from this to be very unlikely by virtue of (1) civilizations like this being very unlikely to have substantial measure over the universe’s resources, (2) transparency making bargaining far easier, and (3) few technologically advanced civilizations would care about humans suffering in particular as opposed to e.g. an adversary running emulations of their own species?
Basically agree with Paul, and I especially want to note that I’ve barely thought about it and so this would likely change a ton with more information. To put some numbers of my own:
“No generalization”: 65%
“Some generalization”: 5% (I don’t actually have stories where this is an outcome; this is more like model uncertainty)
“Lots of generalization, but something went wrong”: 30%
These are from my own perspective of what these categories mean, which I expect are pretty different from yours—e.g. maybe I’m at ~2% that upon reflection I’d decide that hedonium is great and so that’s actually perfect generalization; in the last category I include lots of worlds that I wouldn’t describe as “existing humans recognizably survive”, e.g. we decide to become digital uploads, then get lots of cognitive enhancements, throw away a bunch of evolutionary baggage, but also we never expand to the stars because AI has taken control of it and given us only Earth.
I think the biggest avenues for improving the answers would be to reflect more on the kindness + cooperation and acausal trade stories Paul mentions, as well as the possibility that a few AIs end up generalizing close to correctly and working out a deal with other AIs that involves humanity getting, say, Earth.
Given that not every biological civilization solves the problem, what does the rest of the multiverse look like?
If we’re imagining civilizations very similar to humanity, then the multiverse looks like ~100% of one of the options. Reality’s true answer will be very overdetermined; it is a failure of our map that we cannot determine the answer. I don’t know much about quantum physics / many-worlds, but I’d be pretty surprised if small fluctuations to our world made a huge difference; you’ll need a lot of fluctuations adding up to a lot of improbability before you affect a macro-level property like this, unless you just happen to already be on the knife-edge.
If the biological civilizations could be very different from ours, then I have no idea how to quickly reason about this question and don’t have an answer, sorry.
If we’re imagining civilizations very similar to humanity, then the multiverse looks like ~100% of one of the options. Reality’s true answer will be very overdetermined; it is a failure of our map that we cannot determine the answer. I don’t know much about quantum physics / many-worlds, but I’d be pretty surprised if small fluctuations to our world made a huge difference; you’ll need a lot of fluctuations adding up to a lot of improbability before you affect a macro-level property like this, unless you just happen to already be on the knife-edge.
This doesn’t contradict anything you’re saying but there’s arguably a wager for thinking that we’re on the knife-edge – our actions are more impactful if we are.
[Edit to add point:] The degree to which any particular training approach generalizes is of course likely a fixed fact (like in the Lesswrong post you link to about fire). But different civilizations could try different training approaches, which produces heterogeneity for the multiverse.
Question for anyone, but particularly interested in hearing from Christiano, Shah, or Ngo: any thoughts on what happens when alignment schemes that worked in lower-capability regimes fail to generalize to higher-capability regimes?
For example, you could imagine a spectrum of outcomes from “no generalization” (illustrative example: galaxies tiled with paperclips) to “some generalization” (illustrative example: galaxies tiled with “hedonium” human-ish happiness-brainware) to “enough generalization that existing humans recognizably survive, but something still went wrong from our current perspective” (illustrative examples: “Failed Utopia #4-2”, Friendship Is Optimal, “With Folded Hands”). Given that not every biological civilization solves the problem, what does the rest of the multiverse look like? (How is measure distributed on something like my example spectrum, or whatever I should have typed instead?)
(Previous work: Yudkowsky 2009 “Value Is Fragile”, Christiano 2018 “When Is Unaligned AI Morally Valuable?”, Grace 2019 “But Exactly How Complex and Fragile?”.)
When alignment schemes fail to scale, I think it typically means that they work while the system is unable to overpower/outsmart the oversight process, and then break down when the system becomes able to do so. I think that this usually results in the AI shifting from behavior that is mostly constrained by the training process to behavior that is mostly unconstrained (once they effectively disempower humans).
I think the results are relatively unlikely to be good in virtue of “the AI internalized something about our values, just not everything”, and I’m pretty skeptical of recognizable “near miss” scenarios rather than AI gradually careening in very hard-to-predict directions with minimal connection with the surface features of the training process.
Overall I think that the most likely outcome is a universe that is orthogonal to anything we directly care about, maybe with a vaguely similar flavor owing to convergence depending on how AI motivations shake out. (But likely not close enough to feel great, and quite plausibly with almost no visible relation. Probably much more different from us than we are from aliens.)
I think it’s fairly plausible that the results are OK just because of galaxy-brained considerations about cooperation and niceness, where we might have been in the AI’s shoes and part of being a good cosmic citizen is not worrying too much about who gets to do what they want with the universe. That said, I’d put that at <50% chance, with uncertainty both over empirical questions of how the AI is trained and what kind of psychology that leads to, and very hard problems in moral philosophy.
It’s also fairly plausible to me (maybe also ~50%) that such systems will care enough about humans to give them a tiny slice (e.g. 1e-30) of the universe, whether as part of complicated trade with other civilizations who didn’t mess up alignment so much (or nicer AIs, or whatever other nice civilization is willing to spare 1e-30 of their endowment to help us out), or just because you don’t have to care very much at all. But of course it’s also plausible that destructive conflict between aggressive civilizations leads to horrifying outcomes for us, so it’s not all roses (and even when everything is good those trades do come at real costs).
Don’t have strong views on any of those questions; they seem important but not closely related to my day job so I haven’t thought about them too much.
Also, wouldn’t you expect s-risks from this to be very unlikely by virtue of (1) civilizations like this being very unlikely to have substantial measure over the universe’s resources, (2) transparency making bargaining far easier, and (3) few technologically advanced civilizations would care about humans suffering in particular as opposed to e.g. an adversary running emulations of their own species?
Basically agree with Paul, and I especially want to note that I’ve barely thought about it and so this would likely change a ton with more information. To put some numbers of my own:
“No generalization”: 65%
“Some generalization”: 5% (I don’t actually have stories where this is an outcome; this is more like model uncertainty)
“Lots of generalization, but something went wrong”: 30%
These are from my own perspective of what these categories mean, which I expect are pretty different from yours—e.g. maybe I’m at ~2% that upon reflection I’d decide that hedonium is great and so that’s actually perfect generalization; in the last category I include lots of worlds that I wouldn’t describe as “existing humans recognizably survive”, e.g. we decide to become digital uploads, then get lots of cognitive enhancements, throw away a bunch of evolutionary baggage, but also we never expand to the stars because AI has taken control of it and given us only Earth.
I think the biggest avenues for improving the answers would be to reflect more on the kindness + cooperation and acausal trade stories Paul mentions, as well as the possibility that a few AIs end up generalizing close to correctly and working out a deal with other AIs that involves humanity getting, say, Earth.
If we’re imagining civilizations very similar to humanity, then the multiverse looks like ~100% of one of the options. Reality’s true answer will be very overdetermined; it is a failure of our map that we cannot determine the answer. I don’t know much about quantum physics / many-worlds, but I’d be pretty surprised if small fluctuations to our world made a huge difference; you’ll need a lot of fluctuations adding up to a lot of improbability before you affect a macro-level property like this, unless you just happen to already be on the knife-edge.
If the biological civilizations could be very different from ours, then I have no idea how to quickly reason about this question and don’t have an answer, sorry.
This doesn’t contradict anything you’re saying but there’s arguably a wager for thinking that we’re on the knife-edge – our actions are more impactful if we are.
[Edit to add point:] The degree to which any particular training approach generalizes is of course likely a fixed fact (like in the Lesswrong post you link to about fire). But different civilizations could try different training approaches, which produces heterogeneity for the multiverse.