Rohin Shah comments on [AN #157]: Measuring misalignment in the technology underlying Copilot

Rohin Shah 25 Jul 2021 12:50 UTC
LW: 2 AF: 2
AF
Sure, but don’t you agree that it’s a very confusing use of the term?
Maybe? Idk, according to me the goal of alignment is “create a model that is motivated to help us”, and so misalignment = not-alignment = “the mode is not motivated to help us”. Feels pretty clear to me but illusion of transparency is a thing.
I am making a claim that for the purposes of alignment of capable systems, you do want to talk about “motivation”. So to the extent GPT-N / Codex-N doesn’t have a motivation, but is existentially risky, I’m claiming that you want to give it a motivation. I wouldn’t say this with high confidence but it is my best guess for now.
(Also the fact that gwern, which ascribe agency to GPT-3, quoted specifically this part in his comment is another evidence that you’re implying agency for different people)
I think Gwern is using “agent” in a different way than you are ¯\_(ツ)_/¯
I don’t think Gwern and I would differ much in our predictions about what GPT-3 is going to do in new circumstances. (He’d probably be more specific than me just because he’s worked with it a lot more than I have.)
Maybe the real reason it feels weird for me to call this behavior of Codex misalignment is that it is so obvious?
It doesn’t seem like whether something is obvious or not should determine whether it is misaligned—it’s obvious that a very superintelligent paperclip maximizer would be bad, but clearly we should still call that misaligned.
Almost all specification gaming examples are subtle, or tricky, or exploiting bugs.
I think that’s primarily to emphasize why it is difficult to avoid specification gaming, not because those are the only examples of misalignment.
- adamShimi 28 Jul 2021 13:13 UTC
  LW: 5 AF: 4
  AF Parent
  Sorry for the delay in answering, I was a bit busy.
  I am making a claim that for the purposes of alignment of capable systems, you do want to talk about “motivation”. So to the extent GPT-N / Codex-N doesn’t have a motivation, but is existentially risky, I’m claiming that you want to give it a motivation. I wouldn’t say this with high confidence but it is my best guess for now
  That makes some sense, but I do find the “motivationless” state interesting from an alignment point of view. Because if it has no motivation, it also doesn’t have a motivation to do all the things we don’t want. We thus get some corrigibility by default, because we can change its motivation just by changing the prompt.
  I think Gwern is using “agent” in a different way than you are ¯\_(ツ)_/¯
  I don’t think Gwern and I would differ much in our predictions about what GPT-3 is going to do in new circumstances. (He’d probably be more specific than me just because he’s worked with it a lot more than I have.)
  Agreed that there’s not much difference when predicting GPT-3. But it’s because we’re at the place in the scaling where Gwern (AFAIK) describe the LM as an agent very good at predicting-agent. By definition it will not do anything different from a simulator, since its “goal” literally encode all of its behavior.
  Yet there is a difference when scaling. If Gwern is right (or if LM because more like what he’s describing as they get bigger), then we end up with a single agent which we probably shouldn’t trust because of all our many worries with alignment. On the other hand, if scaled up LM are non-agentic/simulator-like, then they would stay motivationless, and there would be at least the possibility to use them to help alignment research for example, by trying to simulate non-agenty systems.
  It doesn’t seem like whether something is obvious or not should determine whether it is misaligned—it’s obvious that a very superintelligent paperclip maximizer would be bad, but clearly we should still call that misaligned.
  Fair enough.
  I think that’s primarily to emphasize why it is difficult to avoid specification gaming, not because those are the only examples of misalignment.
  Yeah, you’re probably right.
  - Rohin Shah 28 Jul 2021 17:16 UTC
    LW: 2 AF: 2
    AF Parent
    Yet there is a difference when scaling. If Gwern is right (or if LM because more like what he’s describing as they get bigger), then we end up with a single agent which we probably shouldn’t trust because of all our many worries with alignment. On the other hand, if scaled up LM are non-agentic/simulator-like, then they would stay motivationless, and there would be at least the possibility to use them to help alignment research for example, by trying to simulate non-agenty systems.
    Yeah, I agree that in the future there is a difference. I don’t think we know which of these situations we’re going to be in (which is maybe what you’re arguing). Idk what Gwern predicts.
    - adamShimi 28 Jul 2021 17:49 UTC
      LW: 2 AF: 1
      AF Parent
      Exactly. I’m mostly arguing that I don’t think the case for the agent situation is as clear cut as I’ve seen some people defend it, which doesn’t mean it’s not possibly true.