Oh, I think inner-misalignment w/ respect to the reward circuitry is a good, positive thing that we want, so there’s the disconnect (usually misalignment is thought of as bad, and I’m not just mistyping).
Yes, thank you: I didn’t notice that you were making that assumption. This conversation makes a lot more sense to me now.
Human values are formed by inner-misalignment and they have lots of great properties such as avoiding ontological crises, valuing real world things (like diamond maximizer in the OP), and a subset of which cares for all of humanity. We can learn more about this process by focusing more on the “a particular human’s learning process + reward circuitry + “training” environment” part, and less on the evolution part.
If we understand the underlying mechanisms behind human value formation through inner-misalignment w/ respect to the reward circuitry, then we might be able to better develop the theory of learning systems developing values, which includes AGI.
This seems to imply that the aim of this alignment proposal is to solve the alignment problem by aligning the inner values with that of the creators of the AI and bypassing the outer alignment problem. That is really interesting; I’ve updated in the direction of shard theory being more viable as an alignment strategy than I previously believed. I’m still confused about huge parts of it, but we can discuss it more elsewhere.
Yes, thank you: I didn’t notice that you were making that assumption. This conversation makes a lot more sense to me now.
This seems to imply that the aim of this alignment proposal is to solve the alignment problem by aligning the inner values with that of the creators of the AI and bypassing the outer alignment problem. That is really interesting; I’ve updated in the direction of shard theory being more viable as an alignment strategy than I previously believed. I’m still confused about huge parts of it, but we can discuss it more elsewhere.