evhub comments on New report: “Scheming AIs: Will AIs fake alignment during training in order to get power?”

evhub 16 Nov 2023 1:09 UTC
LW: 15 AF: 11
3
AF

If anything, I’ve taken my part of the discussion from Twitter to LW.

Good point. I think I’m misdirecting my annoyance here; I really dislike that there’s so much alignment discussion moving from LW to Twitter, but I shouldn’t have implied that you were responsible for that—and in fact I appreciate that you took the time to move this discussion back here. Sorry about that—I edited my comment.

And my response is that I think the model pays a complexity penalty for runtime computations (since they translate into constraints on parameter values which are needed to implement those computations). Even if those computations are motivated by something we call a “goal”, they still need to be implemented in the circuitry of the model, and thus also constrain its parameters.

Yes, I think we agree there. But that doesn’t imply that just because deceptive alignment is a way of calculating what the training process wants you to do, that you can then just memorize the result of that computation in the weights and thereby simplify the model—for the same reason SGD doesn’t memorize the entire distribution in the weights either.