Richard_Ngo comments on Value systematization: how values become coherent (and misaligned)

Richard_Ngo 11 Jan 2024 18:10 UTC
LW: 3 AF: 3
0
AF
In the standard story, what are the terminal goals? You could say “random” or “a mess”, but I think it’s pretty compelling to say “well, they’re probably related to the things that the agent was rewarded for during training”. And those things will likely include “curiosity, gaining access to more tools or stockpiling resources”.
I call these “convergent final goals” and talk more about them in this post.
I also think that an AGI might systematize other goals that aren’t convergent final goals, but these seem harder to reason about, and my central story for which goals it systematizes are convergent final goals. (Note that this is somewhat true for humans, as well: e.g. curiosity and empowerment/success are final-ish goals for many people.)