The most important claim in your comment is that “human learning → human values” is evidence that solving / preventing inner misalignment is easier than it seems when one looks at it from the “evolution → human values” perspective.
That’s not a claim I made in my comment. It’s technically a claim I agree with, but not one I think is particularly important. Humans do seem better aligned to getting reward across distributional shifts than to achieving inclusive genetic fitness across distributional shifts. However, I’ll freely agree with you that humans are typically misaligned with maximizing the reward from their outer objectives.
I operationalize this as: “After a distributional shift from their learning environment, humans frequently behave in a manner that predictably fails to maximize reward in their new environment, specifically because they continue to implement values they’d acquired from their learning environment which are misaligned to reward maximization in the new environment”. Please let me know if you disagree with my operationalization.
For example, one way in which humans are inner misaligned is that, if you introduce a human into a new environment which has a button that will wirehead the human (thus maximizing reward in the new environment), but has other consequences that are bad by light of the human’s preexisting values (e.g., Logan’s example of killing everyone else), most humans won’t push the button.
The actual claim I made in the comment you’re replying to is that there’s a predictablerelationship between outer optimization criteria and inner values, not that inner values are always aligned with outer optimization criteria. In fact, I’d say we’d be in a pretty bad situation if inner goals reliably orientated towards reward maximization across all environments, because then any sufficiently powerful AGI would most likely wirehead once it was able to do so.
That’s not a claim I made in my comment. It’s technically a claim I agree with, but not one I think is particularly important. Humans do seem better aligned to getting reward across distributional shifts than to achieving inclusive genetic fitness across distributional shifts. However, I’ll freely agree with you that humans are typically misaligned with maximizing the reward from their outer objectives.
I operationalize this as: “After a distributional shift from their learning environment, humans frequently behave in a manner that predictably fails to maximize reward in their new environment, specifically because they continue to implement values they’d acquired from their learning environment which are misaligned to reward maximization in the new environment”. Please let me know if you disagree with my operationalization.
For example, one way in which humans are inner misaligned is that, if you introduce a human into a new environment which has a button that will wirehead the human (thus maximizing reward in the new environment), but has other consequences that are bad by light of the human’s preexisting values (e.g., Logan’s example of killing everyone else), most humans won’t push the button.
The actual claim I made in the comment you’re replying to is that there’s a predictable relationship between outer optimization criteria and inner values, not that inner values are always aligned with outer optimization criteria. In fact, I’d say we’d be in a pretty bad situation if inner goals reliably orientated towards reward maximization across all environments, because then any sufficiently powerful AGI would most likely wirehead once it was able to do so.