See: DeepMind’s How undesired goals can arise with correct rewards for an empirical example of inner misalignment.From a quick skim, that post seems to only be arguing against scheming due to inner misalignment. Let me know if I’m wrong.
See: DeepMind’s How undesired goals can arise with correct rewards for an empirical example of inner misalignment.
From a quick skim, that post seems to only be arguing against scheming due to inner misalignment. Let me know if I’m wrong.