Evolution failed at imparting its goal into humans, since humans have their own goals that they shoot for instead when given a chance.
To me, your framing of inner misalignment sounds like Goodharting itself because we evolved our intrinsic motivations towards these measures because they were good measures in the ancestral environment. But when we got access to advanced technology we kept optimizing on the measure (sex, sugar, beauty, etc) which led to it becoming no longer a measure of the actual target (kids, calories, health, etc.)
I think outer alignment is better thought of as a property of the objective function i.e. βan objective function is outer aligned if it incentivizes or produces the behavior we actually want on the training distribution.β
To me, your framing of inner misalignment sounds like Goodharting itself because we evolved our intrinsic motivations towards these measures because they were good measures in the ancestral environment. But when we got access to advanced technology we kept optimizing on the measure (sex, sugar, beauty, etc) which led to it becoming no longer a measure of the actual target (kids, calories, health, etc.)
I think outer alignment is better thought of as a property of the objective function i.e. βan objective function is outer aligned if it incentivizes or produces the behavior we actually want on the training distribution.β