When alignment schemes fail to scale, I think it typically means that they work while the system is unable to overpower/outsmart the oversight process, and then break down when the system becomes able to do so. I think that this usually results in the AI shifting from behavior that is mostly constrained by the training process to behavior that is mostly unconstrained (once they effectively disempower humans).
I think the results are relatively unlikely to be good in virtue of “the AI internalized something about our values, just not everything”, and I’m pretty skeptical of recognizable “near miss” scenarios rather than AI gradually careening in very hard-to-predict directions with minimal connection with the surface features of the training process.
Overall I think that the most likely outcome is a universe that is orthogonal to anything we directly care about, maybe with a vaguely similar flavor owing to convergence depending on how AI motivations shake out. (But likely not close enough to feel great, and quite plausibly with almost no visible relation. Probably much more different from us than we are from aliens.)
I think it’s fairly plausible that the results are OK just because of galaxy-brained considerations about cooperation and niceness, where we might have been in the AI’s shoes and part of being a good cosmic citizen is not worrying too much about who gets to do what they want with the universe. That said, I’d put that at <50% chance, with uncertainty both over empirical questions of how the AI is trained and what kind of psychology that leads to, and very hard problems in moral philosophy.
It’s also fairly plausible to me (maybe also ~50%) that such systems will care enough about humans to give them a tiny slice (e.g. 1e-30) of the universe, whether as part of complicated trade with other civilizations who didn’t mess up alignment so much (or nicer AIs, or whatever other nice civilization is willing to spare 1e-30 of their endowment to help us out), or just because you don’t have to care very much at all. But of course it’s also plausible that destructive conflict between aggressive civilizations leads to horrifying outcomes for us, so it’s not all roses (and even when everything is good those trades do come at real costs).
Don’t have strong views on any of those questions; they seem important but not closely related to my day job so I haven’t thought about them too much.
But of course it’s also plausible that destructive conflict between aggressive civilizations leads to horrifying outcomes for us
Also, wouldn’t you expect s-risks from this to be very unlikely by virtue of (1) civilizations like this being very unlikely to have substantial measure over the universe’s resources, (2) transparency making bargaining far easier, and (3) few technologically advanced civilizations would care about humans suffering in particular as opposed to e.g. an adversary running emulations of their own species?
When alignment schemes fail to scale, I think it typically means that they work while the system is unable to overpower/outsmart the oversight process, and then break down when the system becomes able to do so. I think that this usually results in the AI shifting from behavior that is mostly constrained by the training process to behavior that is mostly unconstrained (once they effectively disempower humans).
I think the results are relatively unlikely to be good in virtue of “the AI internalized something about our values, just not everything”, and I’m pretty skeptical of recognizable “near miss” scenarios rather than AI gradually careening in very hard-to-predict directions with minimal connection with the surface features of the training process.
Overall I think that the most likely outcome is a universe that is orthogonal to anything we directly care about, maybe with a vaguely similar flavor owing to convergence depending on how AI motivations shake out. (But likely not close enough to feel great, and quite plausibly with almost no visible relation. Probably much more different from us than we are from aliens.)
I think it’s fairly plausible that the results are OK just because of galaxy-brained considerations about cooperation and niceness, where we might have been in the AI’s shoes and part of being a good cosmic citizen is not worrying too much about who gets to do what they want with the universe. That said, I’d put that at <50% chance, with uncertainty both over empirical questions of how the AI is trained and what kind of psychology that leads to, and very hard problems in moral philosophy.
It’s also fairly plausible to me (maybe also ~50%) that such systems will care enough about humans to give them a tiny slice (e.g. 1e-30) of the universe, whether as part of complicated trade with other civilizations who didn’t mess up alignment so much (or nicer AIs, or whatever other nice civilization is willing to spare 1e-30 of their endowment to help us out), or just because you don’t have to care very much at all. But of course it’s also plausible that destructive conflict between aggressive civilizations leads to horrifying outcomes for us, so it’s not all roses (and even when everything is good those trades do come at real costs).
Don’t have strong views on any of those questions; they seem important but not closely related to my day job so I haven’t thought about them too much.
Also, wouldn’t you expect s-risks from this to be very unlikely by virtue of (1) civilizations like this being very unlikely to have substantial measure over the universe’s resources, (2) transparency making bargaining far easier, and (3) few technologically advanced civilizations would care about humans suffering in particular as opposed to e.g. an adversary running emulations of their own species?