Seems like you’re expecting the AI system to be inner aligned? I’m assuming it will have some distribution over mesa objectives (or values, as I call them), and that implies uncertainty over how to weigh them and how they apply to new domains.
Moral reasoning is natural to us like vision and movement are natural to us, so it’s easy to underestimate how much care evolution had to take to get us to do it.
Why are you so confident that evolution played much of a role at all? How did a tendency to engage in a particular style of moral philosophy cognition help in the ancestral environment? Why would that style, in particular, be so beneficial that evolution would “care” so much about it?
My position: mesa objectives learned in domain X do not automatically or easily generalize to a sufficiently distinct domain Y. The style of cognition required to make such generalizations is startlingly close to that which we call “moral philosophy”.
Human social instincts are pretty important, including instincts for following norms and also for pushing back against norms. Not just instincts for specific norms, also one-level-up instincts for norms in general. These form the basis for what I see when I follow the label “moral reasoning.”
I think I do expect AIs to be more inner-aligned than many others (because of the advantages gradient descent has over genetic algorithms). But even if we suppose that we get an AI governed by a mishmash of interdependent processes that sometimes approximate mesa-optimizers, I still don’t expect what you expect—I don’t expect early AGI to even have the standards by which it would say values “fail” to generalize, it would just follow what would seem to us like a bad generalization.
By “moral philosophy”, I’m trying to point to a specific subset of values-related cognition that is much smaller than the totality of values related cognition. Specifically, that subset of values-related cognition that pertains to generalization of existing values to new circumstances. I claim that there exists a simple “core” of how this generalization ought to work for a wide variety of values-holding agentic systems, and that this core is startlingly close to how it works in humans.
It’s of course entirely possible that humans implement a modified version of this core process. However, it’s not clear to me that we want an AI to exactly replicate the human implementation. E.g., do you really want to hard wire an instinct for challenging the norms you try to impose?
Also, I think there are actually two inner misalignments that occurred in humans.
1: Between inclusive genetic fitness as the base objective, evolution as the learning process, and the human reward circuitry as the mesa objectives. 2: Between activation of human reward circuitry as the base objective, human learning as the learning process, and human values as the mesa objectives.
I think AIs will probably be, by default, a bit less misaligned to their reward functions than humans are misaligned to their reward circuitry.
Seems like you’re expecting the AI system to be inner aligned? I’m assuming it will have some distribution over mesa objectives (or values, as I call them), and that implies uncertainty over how to weigh them and how they apply to new domains.
Why are you so confident that evolution played much of a role at all? How did a tendency to engage in a particular style of moral philosophy cognition help in the ancestral environment? Why would that style, in particular, be so beneficial that evolution would “care” so much about it?
My position: mesa objectives learned in domain X do not automatically or easily generalize to a sufficiently distinct domain Y. The style of cognition required to make such generalizations is startlingly close to that which we call “moral philosophy”.
Human social instincts are pretty important, including instincts for following norms and also for pushing back against norms. Not just instincts for specific norms, also one-level-up instincts for norms in general. These form the basis for what I see when I follow the label “moral reasoning.”
I think I do expect AIs to be more inner-aligned than many others (because of the advantages gradient descent has over genetic algorithms). But even if we suppose that we get an AI governed by a mishmash of interdependent processes that sometimes approximate mesa-optimizers, I still don’t expect what you expect—I don’t expect early AGI to even have the standards by which it would say values “fail” to generalize, it would just follow what would seem to us like a bad generalization.
By “moral philosophy”, I’m trying to point to a specific subset of values-related cognition that is much smaller than the totality of values related cognition. Specifically, that subset of values-related cognition that pertains to generalization of existing values to new circumstances. I claim that there exists a simple “core” of how this generalization ought to work for a wide variety of values-holding agentic systems, and that this core is startlingly close to how it works in humans.
It’s of course entirely possible that humans implement a modified version of this core process. However, it’s not clear to me that we want an AI to exactly replicate the human implementation. E.g., do you really want to hard wire an instinct for challenging the norms you try to impose?
Also, I think there are actually two inner misalignments that occurred in humans.
1: Between inclusive genetic fitness as the base objective, evolution as the learning process, and the human reward circuitry as the mesa objectives.
2: Between activation of human reward circuitry as the base objective, human learning as the learning process, and human values as the mesa objectives.
I think AIs will probably be, by default, a bit less misaligned to their reward functions than humans are misaligned to their reward circuitry.