DavidW comments on Order Matters for Deceptive Alignment

DavidW 24 Feb 2023 17:00 UTC
2 points
0
The learner fairly quickly develops a decent representation of the actual goal, world model etc. and pursues this goal
Wouldn’t you expect decision making, and therefore goals, to be in the final layers of the model? If so, they will calculate the goal based on high-level world model neurons. If the world model improves, those high-level abstractions will also improve. The goal doesn’t have to make a model of the goal from scratch, because it is connected to the world model.
When the learner has a really excellent world model that can make long range predictions and so forth—good enough that it can reason itself into playing the training game for a wide class of long-term goals—then we get a large class of objectives that achieve losses just as good as if not better than the initial representation of the goal
Even if the model is sophisticated enough to make long-range predictions, it still has to care about the long-run for it to have an incentive to play the training game. Long-term goals are addressed extensively in this post and the next.
When this happens, gradients derived from regularisation and/or loss may push the learner’s objective towards one of these problematic alternatives.
Suppose we have a model with a sufficiently aligned goal A. I will also denote an unaligned goal as U, and instrumental training reward optimization S. It sounds like your idea is that S gets better training performance than directly pursuing A, so the model should switch its goal to U so it can play the training game and get better performance. But if S gets better training performance than A, then the model doesn’t need to switch its goal to play the training game. It’s already instrumentally valuable. Why would it switch?
Also, because the initial “pretty good” goal is not a long-range one (because it developed when the world model was not so good), it doesn’t necessarily steer the learner away from possibilities like this
Wouldn’t the initial goal continue to update over time? Why would it build a second goal instead of making improvements to the original?