Human social instincts are pretty important, including instincts for following norms and also for pushing back against norms. Not just instincts for specific norms, also one-level-up instincts for norms in general. These form the basis for what I see when I follow the label “moral reasoning.”
I think I do expect AIs to be more inner-aligned than many others (because of the advantages gradient descent has over genetic algorithms). But even if we suppose that we get an AI governed by a mishmash of interdependent processes that sometimes approximate mesa-optimizers, I still don’t expect what you expect—I don’t expect early AGI to even have the standards by which it would say values “fail” to generalize, it would just follow what would seem to us like a bad generalization.
By “moral philosophy”, I’m trying to point to a specific subset of values-related cognition that is much smaller than the totality of values related cognition. Specifically, that subset of values-related cognition that pertains to generalization of existing values to new circumstances. I claim that there exists a simple “core” of how this generalization ought to work for a wide variety of values-holding agentic systems, and that this core is startlingly close to how it works in humans.
It’s of course entirely possible that humans implement a modified version of this core process. However, it’s not clear to me that we want an AI to exactly replicate the human implementation. E.g., do you really want to hard wire an instinct for challenging the norms you try to impose?
Also, I think there are actually two inner misalignments that occurred in humans.
1: Between inclusive genetic fitness as the base objective, evolution as the learning process, and the human reward circuitry as the mesa objectives. 2: Between activation of human reward circuitry as the base objective, human learning as the learning process, and human values as the mesa objectives.
I think AIs will probably be, by default, a bit less misaligned to their reward functions than humans are misaligned to their reward circuitry.
Human social instincts are pretty important, including instincts for following norms and also for pushing back against norms. Not just instincts for specific norms, also one-level-up instincts for norms in general. These form the basis for what I see when I follow the label “moral reasoning.”
I think I do expect AIs to be more inner-aligned than many others (because of the advantages gradient descent has over genetic algorithms). But even if we suppose that we get an AI governed by a mishmash of interdependent processes that sometimes approximate mesa-optimizers, I still don’t expect what you expect—I don’t expect early AGI to even have the standards by which it would say values “fail” to generalize, it would just follow what would seem to us like a bad generalization.
By “moral philosophy”, I’m trying to point to a specific subset of values-related cognition that is much smaller than the totality of values related cognition. Specifically, that subset of values-related cognition that pertains to generalization of existing values to new circumstances. I claim that there exists a simple “core” of how this generalization ought to work for a wide variety of values-holding agentic systems, and that this core is startlingly close to how it works in humans.
It’s of course entirely possible that humans implement a modified version of this core process. However, it’s not clear to me that we want an AI to exactly replicate the human implementation. E.g., do you really want to hard wire an instinct for challenging the norms you try to impose?
Also, I think there are actually two inner misalignments that occurred in humans.
1: Between inclusive genetic fitness as the base objective, evolution as the learning process, and the human reward circuitry as the mesa objectives.
2: Between activation of human reward circuitry as the base objective, human learning as the learning process, and human values as the mesa objectives.
I think AIs will probably be, by default, a bit less misaligned to their reward functions than humans are misaligned to their reward circuitry.