Consider a competent policy that wants paperclips in the very long run. It could reason “I should get a low loss to get paperclips,” and then get a low loss. As a result, it could be selected by gradient descent.
Consider a competent policy that wants paperclips in the very long run. It could reason “I should get a low loss to get paperclips,” and then get a low loss. As a result, it could be selected by gradient descent.