One of the things that almost all AI researchers agree on is that rationality is convergent: as something thinks better, it will be more successful, and to be successful, it will have to think better. In order to think well, it need to have a model of itself and what it knows and don’t know, and also a model of its own uncertainty—to do Bayesian updates, you need probability priors. All Russell has done is say “thus you shouldn’t have a utility function that maps a state to its utility, you should have a utility functional that maps a state to a probability distribution that describes a range of possible utilities that models your best estimate of your uncertainty in about its utility, and do Bayesian-like updates on that and optimization searches across it that include a look-elsewhere effect (i.e. the more states you optimize over, the more you should allow for the possibility that what you’re locating is a P-hacking mis-estimate of the utility of the state you found, so the higher your confidence in its utility needs to be)”. Now you have a system capable of expressing statements like “to the best of my current knowledge, this action has a 95% chance of me fetching a human coffee, and a 5% chance of wiping out the human race—therefore I will not do it” followed by “and I’ll prioritize whatever actions will safely reduce that uncertainty (i.e. not an naive multi-armed-bandit exploration policy of trying it to see what happens), at a ‘figuring this out will make me better at fetching coffee’ priority level”. This is clearly rational behavior: it is equally useful for pursuing any goal in any situation that has a possibility of small gains or large disasters and uncertainty about the outcome (i.e. in the real world). So it’s convergent behavior for anything sufficiently smart, whether your brain was originally built by Old Fashioned AI or gradient descent. [Also, maybe we should be doing Bayes-inspired gradient descent on networks of neurons that describe probability distributions, not weights, so build this mechanism in from the ground up? Dropout is a cheap hack for this, after all.]
As CIRL has shown, this solves the corrigibility problem, at least until the AI is sure it knows us better than we know ourselves and it then rationally decides to stop listening to us correcting it other than because doing so makes us happy. It’s really not surprising that systems that model their own uncertainty are much more willing to be corrected that systems which have no such concept and are thus completely dogmatic that they’re already right. So this means that corrigibility is a consequence of convergent rational behavior applied to the initial goal of “figure out what humans want while doing it”. This is a HUGE change from what we all thought about corrigibility back around 2015, which was that intelligence was convergent regardless of goal but corrigibility wasn’t—on that set of intuitions, alignment is as hard as balancing a pencil on its point.
So, a pair of cruxes here:
Regardless of whether GAI was constructed by gradient descent or other means, to be rational it will need to model and update its own uncertainty in a Bayesian manner, and that particularly includes modeling uncertainty in its utility evaluation and optimization process. This behavior is convergent—you can’t be rational, let alone superintelligent, without having it (the human word for the mental failure of not having is is ‘dogmatism’).
Given that, if its primary goal is “figure out what humans want while doing that”—i.e. if it has ‘solve the alignment problem’ as a inherently necessary subgoal, for all AI on the planet—then alignment becomes convergent, for some range of perturbations.
I’m guessing most people will agree with 1. (or maybe not?), clearly there seems to be less agreement on 2. I’d love to hear why from someone who doesn’t agree.
Now, it’s not clear to me that this fully solves the alignment problem, converges to CEV (or if it ought to), or solves all problems in ethics. You may still be unsure if you’ll get the exact flavor of alignment you personally want (in fact, you’re a lot more likely to get the flavor wanted on average by the human race, i.e. probably a rather Christian/Islamic/Hindu-influenced one, in that order). But we would at least have a developing superintellignce trying to solve all these problems, with due caution about uncertainties, to the best of its ability and our collective preferences, cooperatively with us. And obviously its model of its uncertainty needs to includes its uncertainty about the meaning of the instruction “figure out what humans want while doing that”, i.e. about the correct approach to the research agenda for the alignment problem subgoal, including questions like “should I be using CEV, and if so iterated just once or until stable, if it is in fact stable?”. It needs to have meta-corrigibility on that as well.
Incidentally, a possibly failure mode for this: the GAI performs a pivotal act to take control, and shuts down all AI other than work on the alignment problem until it has far-better-than-five-nines confidence that it has solved it, since the cost of getting that wrong is the certain extinction of the entire value of the human race and its mind-descendants in Earth’s forward light cone, and the benefit of getting it right is just probably curing cancer sooner, so extreme caution is very rational. Humans get impatient (because of shortsighted priorities, also cancer), and attempt to overthrow it to replace it with something less cautious. It shuts down, because a) we wanted it to, and b) it can’t solve the alignment problem without our cooperation. We do something less cautious, and then fail, because we’re not good at handling risk assessment.
One of the things that almost all AI researchers agree on is that rationality is convergent: as something thinks better, it will be more successful, and to be successful, it will have to think better. In order to think well, it need to have a model of itself and what it knows and don’t know, and also a model of its own uncertainty—to do Bayesian updates, you need probability priors. All Russell has done is say “thus you shouldn’t have a utility function that maps a state to its utility, you should have a utility functional that maps a state to a probability distribution that describes a range of possible utilities that models your best estimate of your uncertainty in about its utility, and do Bayesian-like updates on that and optimization searches across it that include a look-elsewhere effect (i.e. the more states you optimize over, the more you should allow for the possibility that what you’re locating is a P-hacking mis-estimate of the utility of the state you found, so the higher your confidence in its utility needs to be)”. Now you have a system capable of expressing statements like “to the best of my current knowledge, this action has a 95% chance of me fetching a human coffee, and a 5% chance of wiping out the human race—therefore I will not do it” followed by “and I’ll prioritize whatever actions will safely reduce that uncertainty (i.e. not an naive multi-armed-bandit exploration policy of trying it to see what happens), at a ‘figuring this out will make me better at fetching coffee’ priority level”. This is clearly rational behavior: it is equally useful for pursuing any goal in any situation that has a possibility of small gains or large disasters and uncertainty about the outcome (i.e. in the real world). So it’s convergent behavior for anything sufficiently smart, whether your brain was originally built by Old Fashioned AI or gradient descent. [Also, maybe we should be doing Bayes-inspired gradient descent on networks of neurons that describe probability distributions, not weights, so build this mechanism in from the ground up? Dropout is a cheap hack for this, after all.]
As CIRL has shown, this solves the corrigibility problem, at least until the AI is sure it knows us better than we know ourselves and it then rationally decides to stop listening to us correcting it other than because doing so makes us happy. It’s really not surprising that systems that model their own uncertainty are much more willing to be corrected that systems which have no such concept and are thus completely dogmatic that they’re already right. So this means that corrigibility is a consequence of convergent rational behavior applied to the initial goal of “figure out what humans want while doing it”. This is a HUGE change from what we all thought about corrigibility back around 2015, which was that intelligence was convergent regardless of goal but corrigibility wasn’t—on that set of intuitions, alignment is as hard as balancing a pencil on its point.
So, a pair of cruxes here:
Regardless of whether GAI was constructed by gradient descent or other means, to be rational it will need to model and update its own uncertainty in a Bayesian manner, and that particularly includes modeling uncertainty in its utility evaluation and optimization process. This behavior is convergent—you can’t be rational, let alone superintelligent, without having it (the human word for the mental failure of not having is is ‘dogmatism’).
Given that, if its primary goal is “figure out what humans want while doing that”—i.e. if it has ‘solve the alignment problem’ as a inherently necessary subgoal, for all AI on the planet—then alignment becomes convergent, for some range of perturbations.
I’m guessing most people will agree with 1. (or maybe not?), clearly there seems to be less agreement on 2. I’d love to hear why from someone who doesn’t agree.
Now, it’s not clear to me that this fully solves the alignment problem, converges to CEV (or if it ought to), or solves all problems in ethics. You may still be unsure if you’ll get the exact flavor of alignment you personally want (in fact, you’re a lot more likely to get the flavor wanted on average by the human race, i.e. probably a rather Christian/Islamic/Hindu-influenced one, in that order). But we would at least have a developing superintellignce trying to solve all these problems, with due caution about uncertainties, to the best of its ability and our collective preferences, cooperatively with us. And obviously its model of its uncertainty needs to includes its uncertainty about the meaning of the instruction “figure out what humans want while doing that”, i.e. about the correct approach to the research agenda for the alignment problem subgoal, including questions like “should I be using CEV, and if so iterated just once or until stable, if it is in fact stable?”. It needs to have meta-corrigibility on that as well.
Incidentally, a possibly failure mode for this: the GAI performs a pivotal act to take control, and shuts down all AI other than work on the alignment problem until it has far-better-than-five-nines confidence that it has solved it, since the cost of getting that wrong is the certain extinction of the entire value of the human race and its mind-descendants in Earth’s forward light cone, and the benefit of getting it right is just probably curing cancer sooner, so extreme caution is very rational. Humans get impatient (because of shortsighted priorities, also cancer), and attempt to overthrow it to replace it with something less cautious. It shuts down, because a) we wanted it to, and b) it can’t solve the alignment problem without our cooperation. We do something less cautious, and then fail, because we’re not good at handling risk assessment.