Nate’s B) currently seems confused. I read a connotation “we need the AGI’s learned concepts to be safe under extreme optimization pressure, such that, when extremized, they yield reasonable results (e.g. human faces from maximizing the AI-faceishness-concept-activation of an image).”
But I think agents will not maximize their own concept activations, when choosing plans. An agent’s values will optimize the world; the values don’t optimize themselves. For example, I think that I am not looking for a romantic relationship which maximally activates my “awesome relationship” concept, if that’s a thing I have. It’s true that conditional on such a plan being considered, my relationship-shard might bid for that plan with strength monotonically increasing on “predicted activation of awesome-relationship”.
And conditional on such a plan getting considered, where that concept activation is maximized, I would therefore be very inclined to pursue that plan.
But I think it’s not true that my relationship-shard is optimizing its own future activations by extremizing future concept activations. I think that this plan won’t get found, and the agent won’t want to find this plan. Values are not the optimization target. (This point explained in more detail: Alignment allows “nonrobust” decision-influences and doesn’t require robust grading)
Nate’s B) currently seems confused. I read a connotation “we need the AGI’s learned concepts to be safe under extreme optimization pressure, such that, when extremized, they yield reasonable results (e.g. human faces from maximizing the AI-faceishness-concept-activation of an image).”
But I think agents will not maximize their own concept activations, when choosing plans. An agent’s values will optimize the world; the values don’t optimize themselves. For example, I think that I am not looking for a romantic relationship which maximally activates my “awesome relationship” concept, if that’s a thing I have. It’s true that conditional on such a plan being considered, my relationship-shard might bid for that plan with strength monotonically increasing on “predicted activation of awesome-relationship”.
And conditional on such a plan getting considered, where that concept activation is maximized, I would therefore be very inclined to pursue that plan.
But I think it’s not true that my relationship-shard is optimizing its own future activations by extremizing future concept activations. I think that this plan won’t get found, and the agent won’t want to find this plan. Values are not the optimization target. (This point explained in more detail: Alignment allows “nonrobust” decision-influences and doesn’t require robust grading)