Jon Garcia comments on Don’t align agents to evaluations of plans

Jon Garcia 27 Nov 2022 18:28 UTC
2 points
0
I think grading in some form will be necessary in the sense that we don’t know what value heuristics will be sufficient to ensure alignment in the AI. We will most likely need to add corrections to its reward signals on the fly, even as it learns to extrapolate its own values from those heuristics. In other words, grading.

However, it seems the crucial point is that we need to avoid including grader evaluations as part of the AI’s self-evaluation model, for the same reason that we shouldn’t give it access to its reward button. In other words, don’t build the AI like this:

[planning module] → [predicted grader output] → [internal reward signal] → [reinforce policy function]

Instead, it should look more like this:

[planning module] → [predicted world state] → [internal reward signal] → [reinforce policy function]

The predicted grader output may be part of the AI’s predicted world state (if a grader is used), but it shouldn’t be the part that triggers reward. The trick, then, would be to identify the part of the AI’s world model that corresponds to what we want it to care about and feed only that part into the learned reward signal.