The outer optimization target for the human learning process is kind of indeterminate, but to the extent we can determine it, it’s something like “learn the things that causally contributed to IGF in the ancestral environment.” This isn’t the same as IGF itself. It would include cooperation, sex drive, a fear of death, a taste for sugary and fatty foods, etc. We seem to be pretty well aligned from that perspective.
I feel like the lesson here is actually “if alignment is too hard, then lower your aim until you find a target that’s easy enough to align to.” I mean, why is the outer optimization target for the human learning process these imperfect proxies for IGF that must have sometimes contributed negatively or suboptimally to IGF even in the ancestral environment? Why not align to IGF itself or better proxies? Probably because that was too hard for both outer and inner alignment reasons?
From this perspective, evolution managed to find alignment targets that were both “easy enough” and “good enough” (at least until the recent advent of things like porn, birth control, processed foods), and solved outer and inner alignment for these targets, but it had millions or billions of “training runs” to play with, and little chance of global catastrophe if something went wrong during one of these “training runs”.
Given your overall optimism about AI alignment, you probably view this analogy more optimistically than I do. Would be interested to hear you spell out your own perspective more.
It’s very difficult to get any agent to robustly pursue something like IGF because it’s an inherently sparse and beyond-lifetime goal. Human values have been pre-densified for us: they are precisely the kinds of things it’s easy to get an intelligence to pursue fairly robustly. We get dense, repeated, in-lifetime feedback about stuff like sex, food, love, revenge, and so on. A priori, if you’re an agent built by evolution, you should expect to have values that are easy to learn— it would be surprising if it turned out that evolution did things the hard way. So evolution suggests alignment should be easy.
What if some humans actually value something that’s sparse and beyond-lifetime like IGF? For example, Nick Bostrom seems to value avoiding astronomical waste. How to explain that, if our values only come from “dense, repeated, in-lifetime feedback”?
See also this top-level comment which may be related. If some people value philosophy and following correct philosophical conclusions, that would explain Nick Bostrom, but I’m not sure what “valuing philosophy” is about exactly, or how to align AI to do that. Any thoughts on this?
People come to have sparse and beyond-lifetime goals through mechanisms that are unavailable to biological evolution— it took thousands of years of memetic evolution for people to even develop the concept of a long future that we might be able to affect with our short lives. We’re in a much better position to instill long-range goals into AIs, if we choose to do so— we can simply train them to imitate human thought processes which give rise to longterm-oriented behaviors.
I feel like the lesson here is actually “if alignment is too hard, then lower your aim until you find a target that’s easy enough to align to.” I mean, why is the outer optimization target for the human learning process these imperfect proxies for IGF that must have sometimes contributed negatively or suboptimally to IGF even in the ancestral environment? Why not align to IGF itself or better proxies? Probably because that was too hard for both outer and inner alignment reasons?
From this perspective, evolution managed to find alignment targets that were both “easy enough” and “good enough” (at least until the recent advent of things like porn, birth control, processed foods), and solved outer and inner alignment for these targets, but it had millions or billions of “training runs” to play with, and little chance of global catastrophe if something went wrong during one of these “training runs”.
Given your overall optimism about AI alignment, you probably view this analogy more optimistically than I do. Would be interested to hear you spell out your own perspective more.
It’s very difficult to get any agent to robustly pursue something like IGF because it’s an inherently sparse and beyond-lifetime goal. Human values have been pre-densified for us: they are precisely the kinds of things it’s easy to get an intelligence to pursue fairly robustly. We get dense, repeated, in-lifetime feedback about stuff like sex, food, love, revenge, and so on. A priori, if you’re an agent built by evolution, you should expect to have values that are easy to learn— it would be surprising if it turned out that evolution did things the hard way. So evolution suggests alignment should be easy.
What if some humans actually value something that’s sparse and beyond-lifetime like IGF? For example, Nick Bostrom seems to value avoiding astronomical waste. How to explain that, if our values only come from “dense, repeated, in-lifetime feedback”?
See also this top-level comment which may be related. If some people value philosophy and following correct philosophical conclusions, that would explain Nick Bostrom, but I’m not sure what “valuing philosophy” is about exactly, or how to align AI to do that. Any thoughts on this?
People come to have sparse and beyond-lifetime goals through mechanisms that are unavailable to biological evolution— it took thousands of years of memetic evolution for people to even develop the concept of a long future that we might be able to affect with our short lives. We’re in a much better position to instill long-range goals into AIs, if we choose to do so— we can simply train them to imitate human thought processes which give rise to longterm-oriented behaviors.