I’m confused. What is the outer optimization target for human learning?
My two top guesses below.
To me it looks like human values are result of humans learning from environment (which was influenced by humans before and includes current humans). So it’s kind of like human values are what humans learned by definition. So observing that humans learned human values doesn’t tell us anything.
Or maybe you mean something like parents / society / … teaching new humans their values? I see some other problems there:
I’m not sure what’s success rate but values seem to be changing noticeably
There was a lot of time to test multiple methods of teaching new humans values, with humans not changing that much.
The outer optimization target for the human learning process is kind of indeterminate, but to the extent we can determine it, it’s something like “learn the things that causally contributed to IGF in the ancestral environment.” This isn’t the same as IGF itself. It would include cooperation, sex drive, a fear of death, a taste for sugary and fatty foods, etc. We seem to be pretty well aligned from that perspective.
Also, if you view evolution from a wider perspective, we’re not that misaligned, since it’s just trying to find sticky patterns that reproduce themselves a lot, and it seems likely that human civilization will conquer the lightcone in some form or another fairly soon (even if it’s misaligned AI doing it).
The outer optimization target for the human learning process is kind of indeterminate, but to the extent we can determine it, it’s something like “learn the things that causally contributed to IGF in the ancestral environment.” This isn’t the same as IGF itself. It would include cooperation, sex drive, a fear of death, a taste for sugary and fatty foods, etc. We seem to be pretty well aligned from that perspective.
I feel like the lesson here is actually “if alignment is too hard, then lower your aim until you find a target that’s easy enough to align to.” I mean, why is the outer optimization target for the human learning process these imperfect proxies for IGF that must have sometimes contributed negatively or suboptimally to IGF even in the ancestral environment? Why not align to IGF itself or better proxies? Probably because that was too hard for both outer and inner alignment reasons?
From this perspective, evolution managed to find alignment targets that were both “easy enough” and “good enough” (at least until the recent advent of things like porn, birth control, processed foods), and solved outer and inner alignment for these targets, but it had millions or billions of “training runs” to play with, and little chance of global catastrophe if something went wrong during one of these “training runs”.
Given your overall optimism about AI alignment, you probably view this analogy more optimistically than I do. Would be interested to hear you spell out your own perspective more.
It’s very difficult to get any agent to robustly pursue something like IGF because it’s an inherently sparse and beyond-lifetime goal. Human values have been pre-densified for us: they are precisely the kinds of things it’s easy to get an intelligence to pursue fairly robustly. We get dense, repeated, in-lifetime feedback about stuff like sex, food, love, revenge, and so on. A priori, if you’re an agent built by evolution, you should expect to have values that are easy to learn— it would be surprising if it turned out that evolution did things the hard way. So evolution suggests alignment should be easy.
What if some humans actually value something that’s sparse and beyond-lifetime like IGF? For example, Nick Bostrom seems to value avoiding astronomical waste. How to explain that, if our values only come from “dense, repeated, in-lifetime feedback”?
See also this top-level comment which may be related. If some people value philosophy and following correct philosophical conclusions, that would explain Nick Bostrom, but I’m not sure what “valuing philosophy” is about exactly, or how to align AI to do that. Any thoughts on this?
People come to have sparse and beyond-lifetime goals through mechanisms that are unavailable to biological evolution— it took thousands of years of memetic evolution for people to even develop the concept of a long future that we might be able to affect with our short lives. We’re in a much better position to instill long-range goals into AIs, if we choose to do so— we can simply train them to imitate human thought processes which give rise to longterm-oriented behaviors.
trying to find patterns that reduce themselves a lot with minimal change in the patterns (but still some change) seems like a better model of evolution to me, and by that metric, if we solve ai alignment with us, I think we’ll end up mostly solving our alignment with dna’s values—much of what dna valued has been lost, but those who care about the environment for its own sake and beauty will represent a high enough capability group to construct the repair process. if given the chance to do so by an AI that respects their values, anyway.
I’m confused. What is the outer optimization target for human learning?
My two top guesses below.
To me it looks like human values are result of humans learning from environment (which was influenced by humans before and includes current humans). So it’s kind of like human values are what humans learned by definition. So observing that humans learned human values doesn’t tell us anything.
Or maybe you mean something like parents / society / … teaching new humans their values? I see some other problems there:
I’m not sure what’s success rate but values seem to be changing noticeably
There was a lot of time to test multiple methods of teaching new humans values, with humans not changing that much.
The outer optimization target for the human learning process is kind of indeterminate, but to the extent we can determine it, it’s something like “learn the things that causally contributed to IGF in the ancestral environment.” This isn’t the same as IGF itself. It would include cooperation, sex drive, a fear of death, a taste for sugary and fatty foods, etc. We seem to be pretty well aligned from that perspective.
Also, if you view evolution from a wider perspective, we’re not that misaligned, since it’s just trying to find sticky patterns that reproduce themselves a lot, and it seems likely that human civilization will conquer the lightcone in some form or another fairly soon (even if it’s misaligned AI doing it).
I feel like the lesson here is actually “if alignment is too hard, then lower your aim until you find a target that’s easy enough to align to.” I mean, why is the outer optimization target for the human learning process these imperfect proxies for IGF that must have sometimes contributed negatively or suboptimally to IGF even in the ancestral environment? Why not align to IGF itself or better proxies? Probably because that was too hard for both outer and inner alignment reasons?
From this perspective, evolution managed to find alignment targets that were both “easy enough” and “good enough” (at least until the recent advent of things like porn, birth control, processed foods), and solved outer and inner alignment for these targets, but it had millions or billions of “training runs” to play with, and little chance of global catastrophe if something went wrong during one of these “training runs”.
Given your overall optimism about AI alignment, you probably view this analogy more optimistically than I do. Would be interested to hear you spell out your own perspective more.
It’s very difficult to get any agent to robustly pursue something like IGF because it’s an inherently sparse and beyond-lifetime goal. Human values have been pre-densified for us: they are precisely the kinds of things it’s easy to get an intelligence to pursue fairly robustly. We get dense, repeated, in-lifetime feedback about stuff like sex, food, love, revenge, and so on. A priori, if you’re an agent built by evolution, you should expect to have values that are easy to learn— it would be surprising if it turned out that evolution did things the hard way. So evolution suggests alignment should be easy.
What if some humans actually value something that’s sparse and beyond-lifetime like IGF? For example, Nick Bostrom seems to value avoiding astronomical waste. How to explain that, if our values only come from “dense, repeated, in-lifetime feedback”?
See also this top-level comment which may be related. If some people value philosophy and following correct philosophical conclusions, that would explain Nick Bostrom, but I’m not sure what “valuing philosophy” is about exactly, or how to align AI to do that. Any thoughts on this?
People come to have sparse and beyond-lifetime goals through mechanisms that are unavailable to biological evolution— it took thousands of years of memetic evolution for people to even develop the concept of a long future that we might be able to affect with our short lives. We’re in a much better position to instill long-range goals into AIs, if we choose to do so— we can simply train them to imitate human thought processes which give rise to longterm-oriented behaviors.
trying to find patterns that reduce themselves a lot with minimal change in the patterns (but still some change) seems like a better model of evolution to me, and by that metric, if we solve ai alignment with us, I think we’ll end up mostly solving our alignment with dna’s values—much of what dna valued has been lost, but those who care about the environment for its own sake and beauty will represent a high enough capability group to construct the repair process. if given the chance to do so by an AI that respects their values, anyway.