Quintin Pope comments on New report: “Scheming AIs: Will AIs fake alignment during training in order to get power?”

Quintin Pope 16 Nov 2023 0:53 UTC
LW: 12 AF: 8
−2
AF
I really don’t like that you’ve taken this discussion to Twitter. I think Twitter is really a much worse forum for talking about complex issues like this than LW/AF.
I haven’t “taken this discussion to Twitter”. Joe Carlsmith posted about the paper on Twitter. I saw that post, and wrote my response on Twitter. I didn’t even know it was also posted on LW until later, and decided to repost the stuff I’d written on Twitter here. If anything, I’ve taken my part of the discussion from Twitter to LW. I’m slightly baffled and offended that you seem to be platform-policing me?
Anyways, it looks like you’re making the objection I predicted with the paragraphs:
One obvious counterpoint I expect is to claim that the “[have some internal goal x] [backchain from wanting x to the stuff needed to get x (doing well at training)]” steps actually do contribute to the later steps, maybe because they’re a short way to compress a motivational pointer to “wanting” to do well on the training objective.
I don’t think this is how NN simplicity biases work. Under the “cognitive executions impose constraints on parameter settings” perspective, you don’t actually save any complexity by supposing that the model has some motive for figuring stuff out internally, because the circuits required to implement the “figure stuff out internally” computations themselves count as additional complexity. In contrast, if you have a view of simplicity that’s closer to program description length, then you’re not counting runtime execution against program complexity, and so a program that has short length in code but long runtime can count as simple.
In particular, when I said “maybe because they’re a short way to compress a motivational pointer to “wanting” to do well on the training objective.” I think this is pointing at the same thing you reference when you say “The entire question is about what the easiest way is to produce that distribution in terms of the inductive biases.”
I.e., given the actual simplicity bias of models, what is the shortest (or most compressed) way of specifying “a model that starts by trying to do well in training”? And my response is that I think the model pays a complexity penalty for runtime computations (since they translate into constraints on parameter values which are needed to implement those computations). Even if those computations are motivated by something we call a “goal”, they still need to be implemented in the circuitry of the model, and thus also constrain its parameters.
Also, when I reference models whose internal cognition looks like “[figure out how to do well at training] [actually do well at training]”, I don’t have sycophantic models in particular in mind. It also includes aligned models, since those models do implement the “[figure out how to do well at training] [actually do well at training]” steps (assuming that aligned behavior does well in training).
- evhub 16 Nov 2023 1:09 UTC
  LW: 15 AF: 11
  3
  AF Parent
  
  If anything, I’ve taken my part of the discussion from Twitter to LW.
  
  Good point. I think I’m misdirecting my annoyance here; I really dislike that there’s so much alignment discussion moving from LW to Twitter, but I shouldn’t have implied that you were responsible for that—and in fact I appreciate that you took the time to move this discussion back here. Sorry about that—I edited my comment.
  
  And my response is that I think the model pays a complexity penalty for runtime computations (since they translate into constraints on parameter values which are needed to implement those computations). Even if those computations are motivated by something we call a “goal”, they still need to be implemented in the circuitry of the model, and thus also constrain its parameters.
  
  Yes, I think we agree there. But that doesn’t imply that just because deceptive alignment is a way of calculating what the training process wants you to do, that you can then just memorize the result of that computation in the weights and thereby simplify the model—for the same reason SGD doesn’t memorize the entire distribution in the weights either.