Rafael Harth comments on Inner Alignment: Explain like I’m 12 Edition

Rafael Harth 1 Nov 2022 9:07 UTC
2 points
Say we’re at the stage where the model only has a bunch of heuristics as its utility function. Then SGD comes and modifies the model to replace the heuristics with a pointer somewhere into the world model, say to place A. Here, A should look kind of like the Base Objective.

Now the model gets smarter, updates its world model, and comes up with a better representation of the base objective somewhere else in the world model, say at place B. It could now choose to change the pointer to point to B instead of A. But ( … this is the point the quoted part made …) it has no reason to do this. Its current utility function is the thing at place A; the thing at place B is different.

So the pointer isn’t going to change by itself when the model processes more input. Though it could always change when SGD modifies the model.