I’m torn because I mostly agree with Eliezer that things don’t look good, and most technical approaches don’t seem very promising.
But the attitude of unmitigated doomyness seems counter-productive. And there’s obviously things worth doing and working on and people getting on with it.
It seems like Eliezer is implicitly focused on finding an “ultimate solution” to alignment that we can be highly confident solves the problem regardless of how things play out. But this is not where the expected utility is. The expected utility is mostly in buying time and increasing the probability of success in situations where we are not highly confident that we’ve solved the problem, but we get lucky.
Ideally we won’t end up rolling the dice on unprincipled alignment approaches. But we probably will. So let’s try and load the dice. But let’s also remember that that’s what we’re doing.
I guess actually the goal is just to get something aligned enough to do a pivotal act. I don’t see though why an approach that tries to maintain a relatively-sufficient level of alignment (relative to current capabilities) as capabilities scale couldn’t work for that.
Yudkowsky mentions this briefly in the middle of the dialogue:
I don’t know however if I should be explaining at this point why “manipulate humans” is convergent, why “conceal that you are manipulating humans” is convergent, why you have to train in safe regimes in order to get safety in dangerous regimes (because if you try to “train” at a sufficiently unsafe level, the output of the unaligned system deceives you into labeling it incorrectly and/or kills you before you can label the outputs), or why attempts to teach corrigibility in safe regimes are unlikely to generalize well to higher levels of intelligence and unsafe regimes (qualitatively new thought processes, things being way out of training distribution, and, the hardest part to explain, corrigibility being “anti-natural” in a certain sense that makes it incredibly hard to, eg, exhibit any coherent planning behavior (“consistent utility function”) which corresponds to being willing to let somebody else shut you off, without incentivizing you to actively manipulate them to shut you off).
Basically, there are reasons to expect that alignment techniques that work in smaller safe regime fail in larger, unsafe regimes. For example, an alignment technique that requires your system demonstrate undesirable behavior while running could remain safe while your system is weak, but then become dangerous when undesirable behavior from your system becomes powerful.
I’m familiar with these claims, and (I believe) the principle supporting arguments that have been made publicly. I think I understand them reasonably well.
I don’t find them decisive. Some aren’t even particularly convincing. A few points:
- EY sets up a false dichotomy between “train in safe regimes” and “train in dangerous regimes”. In the approaches under discussion there is an ongoing effort (e.g. involving some form(s) of training) to align the system, and the proposal is to keep this effort ahead of advances in capability (in some sense).
- The first 2 claims for why corrigibility wouldn’t generalize seem to prove too much—why would intelligence generalize to “qualitatively new thought processes, things being way out of training distribution”, but corrigibility would not?
- I think the last claim—that corrigibility is “anti-natural”—is more compelling. However, we don’t actually understand the space of possible utility functions and agent designs well enough for it to be that strong. We know that any behavior is compatible with a utility function, so I would interpret Eliezer’s claim as relating to the complexity of description length of utility functions that encode corrigible behavior. Work on incentives suggests that removing manipulation incentives might add very little complexity to the description of the utility function, for an AI system that already understands the world well. Humans also seem to find it simple enough to add the “without manipulation” qualifier to an objective.
why would intelligence generalize to “qualitatively new thought processes, things being way out of training distribution”, but corrigibility would not?
This sounds confused to me: the intelligence is the “qualitatively new thought processes”. The thought processes aren’t some new regime that intelligence has to generalize to. Also to answer the question directly, I think the claim is that intelligence (which I’d say is synonymous for these purposes with capability) is simpler and more natural than corrigibility (i.e., the last claim—I don’t think these three claims are to be taken separately).
We know that any behavior is compatible with a utility function
People keep saying this but it seems false to me. I’ve seen the construction for history-based utility functions that’s supposed to show this, and don’t find it compelling—it seems not to be engaging with what EY is getting at with “coherent planning behavior”. Is there a construction for (environment)-state-based utility functions? I’m not saying that is exactly the right formalism to demonstrate the relationship between coherent behaviour and utility functions, but it seems to me closer to the spirit of what EY is getting at. (This comment thread on the topic seems pretty unresolved to me.)
I’m curious why this comment has such low karma and has −1 alignment forum karma.
If you think doom is very likely when AI reaches a certain level, than efforts to buy us time before then have the highest expected utility. The best way to buy time, arguably, is to study the different AI approaches that exist today and figure out which ones are the most likely to lead to dangerous AI. Then create regulations (either through government or at corporation level) banning the types of AI systems that are proving to be very hard to align. (For example we may want to ban expected reward/utility maximizers completely—satisficers should be able to do everything we want. Also, we may decide there’s really no need for AI to be able to self modify and ban that too.) Of course a ban can’t be applied universally, so existentially dangerous types of AI will get developed somewhere somehow, and there’s likely to be existentially dangerous types of AI we won’t have thought of that will still get developed, but at least we’ll be able to buy some time to do more alignment research that hopefully will help when that existentially dangerous AI is unleashed.
(addendum: what I’m basically saying is that prosaic research can help us slow down take-off speed which is generally considered a good thing).
I’m torn because I mostly agree with Eliezer that things don’t look good, and most technical approaches don’t seem very promising.
But the attitude of unmitigated doomyness seems counter-productive.
And there’s obviously things worth doing and working on and people getting on with it.
It seems like Eliezer is implicitly focused on finding an “ultimate solution” to alignment that we can be highly confident solves the problem regardless of how things play out. But this is not where the expected utility is. The expected utility is mostly in buying time and increasing the probability of success in situations where we are not highly confident that we’ve solved the problem, but we get lucky.
Ideally we won’t end up rolling the dice on unprincipled alignment approaches. But we probably will. So let’s try and load the dice. But let’s also remember that that’s what we’re doing.
I guess actually the goal is just to get something aligned enough to do a pivotal act. I don’t see though why an approach that tries to maintain a relatively-sufficient level of alignment (relative to current capabilities) as capabilities scale couldn’t work for that.
Yudkowsky mentions this briefly in the middle of the dialogue:
Basically, there are reasons to expect that alignment techniques that work in smaller safe regime fail in larger, unsafe regimes. For example, an alignment technique that requires your system demonstrate undesirable behavior while running could remain safe while your system is weak, but then become dangerous when undesirable behavior from your system becomes powerful.
That being said, Ajeya’s “Case for Aligning Narrowly Superhuman models” does flesh out the case for trying to align existing systems (as capabilities scale).
If you know of a reference to, or feel like expaining in some detail, the arguments given (in parentheses) for this claim, I’d love to hear them!
I’m familiar with these claims, and (I believe) the principle supporting arguments that have been made publicly. I think I understand them reasonably well.
I don’t find them decisive. Some aren’t even particularly convincing. A few points:
- EY sets up a false dichotomy between “train in safe regimes” and “train in dangerous regimes”. In the approaches under discussion there is an ongoing effort (e.g. involving some form(s) of training) to align the system, and the proposal is to keep this effort ahead of advances in capability (in some sense).
- The first 2 claims for why corrigibility wouldn’t generalize seem to prove too much—why would intelligence generalize to “qualitatively new thought processes, things being way out of training distribution”, but corrigibility would not?
- I think the last claim—that corrigibility is “anti-natural”—is more compelling. However, we don’t actually understand the space of possible utility functions and agent designs well enough for it to be that strong. We know that any behavior is compatible with a utility function, so I would interpret Eliezer’s claim as relating to the complexity of description length of utility functions that encode corrigible behavior. Work on incentives suggests that removing manipulation incentives might add very little complexity to the description of the utility function, for an AI system that already understands the world well. Humans also seem to find it simple enough to add the “without manipulation” qualifier to an objective.
This sounds confused to me: the intelligence is the “qualitatively new thought processes”. The thought processes aren’t some new regime that intelligence has to generalize to. Also to answer the question directly, I think the claim is that intelligence (which I’d say is synonymous for these purposes with capability) is simpler and more natural than corrigibility (i.e., the last claim—I don’t think these three claims are to be taken separately).
People keep saying this but it seems false to me. I’ve seen the construction for history-based utility functions that’s supposed to show this, and don’t find it compelling—it seems not to be engaging with what EY is getting at with “coherent planning behavior”. Is there a construction for (environment)-state-based utility functions? I’m not saying that is exactly the right formalism to demonstrate the relationship between coherent behaviour and utility functions, but it seems to me closer to the spirit of what EY is getting at. (This comment thread on the topic seems pretty unresolved to me.)
I’m curious why this comment has such low karma and has −1 alignment forum karma.
If you think doom is very likely when AI reaches a certain level, than efforts to buy us time before then have the highest expected utility. The best way to buy time, arguably, is to study the different AI approaches that exist today and figure out which ones are the most likely to lead to dangerous AI. Then create regulations (either through government or at corporation level) banning the types of AI systems that are proving to be very hard to align. (For example we may want to ban expected reward/utility maximizers completely—satisficers should be able to do everything we want. Also, we may decide there’s really no need for AI to be able to self modify and ban that too.) Of course a ban can’t be applied universally, so existentially dangerous types of AI will get developed somewhere somehow, and there’s likely to be existentially dangerous types of AI we won’t have thought of that will still get developed, but at least we’ll be able to buy some time to do more alignment research that hopefully will help when that existentially dangerous AI is unleashed.
(addendum: what I’m basically saying is that prosaic research can help us slow down take-off speed which is generally considered a good thing).