As a failure mode of specification gaming, agents might modify their own goals. As a convergent instrumental goal, agents want to prevent their goals to be modified.I think I know how to resolve this apparent contradiction, but I’d like to see other people’s opinions about it.
As a failure mode of specification gaming, agents might modify their own goals.
As a convergent instrumental goal, agents want to prevent their goals to be modified.
I think I know how to resolve this apparent contradiction, but I’d like to see other people’s opinions about it.