adamShimi comments on [AN #157]: Measuring misalignment in the technology underlying Copilot

adamShimi 24 Jul 2021 9:06 UTC
LW: 7 AF: 1
AF
Sorry for ascribing you beliefs you don’t have. I guess I’m just used to people here and in other places assuming goals and agency in language models, and also some of your choices of words sounded very goal-directed/intentional stance to me.
Maybe you’re objecting to the “motivated” part of that sentence? But I was saying that it isn’t motivated to help us, not that it is motivated to do something else.
Sure, but don’t you agree that it’s a very confusing use of the term? Like, if I say GPT-3 isn’t trying to kill me, I’m not saying it is trying to kill anyone, but I’m sort of implying that it’s the right framing to talk about it. In this case, the “motivated” part did triggered me, because it implied that the right framing is to think about what Codex wants, which I don’t think is right (and apparently you agree).
(Also the fact that gwern, which ascribe agency to GPT-3, quoted specifically this part in his comment is another evidence that you’re implying agency for different people)
Maybe you’re objecting to words like “know” and “capable”? But those don’t seem to imply agency/goals; it seems reasonable to say that Google Maps knows about traffic patterns and is capable of predicting route times.
Agreed with you there.
As an aside, this was Codex rather than GPT-3, though I’d say the same thing for both.
True, but I don’t feel like there is a significant difference between Codex and GPT-3 in terms of size or training to warrant different conclusions with regard to ascribing goals/agency.
I don’t care what it is trained for; I care whether it solves my problem. Are you telling me that you wouldn’t count any of the reward misspecification examples as misalignment? After all, those agents were trained to optimize the reward, not to analyze what you meant and fix your reward.
First, I think I interpreted “misalignment” here to mean “inner misalignment”, hence my answer. I also agree that all examples in Victoria’s doc are showing misalignment. That being said, I still think there is a difference with the specification gaming stuff.
Maybe the real reason it feels weird for me to call this behavior of Codex misalignment is that it is so obvious? Almost all specification gaming examples are subtle, or tricky, or exploiting bugs. They’re things that I would expect a human to fail to find, even given the precise loss and training environment. Whereas I expect any human to complete buggy code with buggy code once you explain to them that Codex looks for the most probable next token based on all the code.
But there doesn’t seem to be a real disagreement between us: I agree that GPT-3/Codex seem fundamentally unable to get really good at the “Chatbot task” I described above, which is what I gather you mean by “solving my problem”.
(By the way, I have an old post about formulating this task that we want GPT-3 to solve. It was written before I actually studied GPT-3 but that holds decently well I think. I also did some experiments on GPT-3 with EleutherAI people on whether bigger models get better at answering more variations of the prompt for the same task.)
- Rohin Shah 25 Jul 2021 12:50 UTC
  LW: 2 AF: 2
  AF Parent
  Sure, but don’t you agree that it’s a very confusing use of the term?
  Maybe? Idk, according to me the goal of alignment is “create a model that is motivated to help us”, and so misalignment = not-alignment = “the mode is not motivated to help us”. Feels pretty clear to me but illusion of transparency is a thing.
  I am making a claim that for the purposes of alignment of capable systems, you do want to talk about “motivation”. So to the extent GPT-N / Codex-N doesn’t have a motivation, but is existentially risky, I’m claiming that you want to give it a motivation. I wouldn’t say this with high confidence but it is my best guess for now.
  (Also the fact that gwern, which ascribe agency to GPT-3, quoted specifically this part in his comment is another evidence that you’re implying agency for different people)
  I think Gwern is using “agent” in a different way than you are ¯\_(ツ)_/¯
  I don’t think Gwern and I would differ much in our predictions about what GPT-3 is going to do in new circumstances. (He’d probably be more specific than me just because he’s worked with it a lot more than I have.)
  Maybe the real reason it feels weird for me to call this behavior of Codex misalignment is that it is so obvious?
  It doesn’t seem like whether something is obvious or not should determine whether it is misaligned—it’s obvious that a very superintelligent paperclip maximizer would be bad, but clearly we should still call that misaligned.
  Almost all specification gaming examples are subtle, or tricky, or exploiting bugs.
  I think that’s primarily to emphasize why it is difficult to avoid specification gaming, not because those are the only examples of misalignment.
  - adamShimi 28 Jul 2021 13:13 UTC
    LW: 5 AF: 4
    AF Parent
    Sorry for the delay in answering, I was a bit busy.
    I am making a claim that for the purposes of alignment of capable systems, you do want to talk about “motivation”. So to the extent GPT-N / Codex-N doesn’t have a motivation, but is existentially risky, I’m claiming that you want to give it a motivation. I wouldn’t say this with high confidence but it is my best guess for now
    That makes some sense, but I do find the “motivationless” state interesting from an alignment point of view. Because if it has no motivation, it also doesn’t have a motivation to do all the things we don’t want. We thus get some corrigibility by default, because we can change its motivation just by changing the prompt.
    I think Gwern is using “agent” in a different way than you are ¯\_(ツ)_/¯
    I don’t think Gwern and I would differ much in our predictions about what GPT-3 is going to do in new circumstances. (He’d probably be more specific than me just because he’s worked with it a lot more than I have.)
    Agreed that there’s not much difference when predicting GPT-3. But it’s because we’re at the place in the scaling where Gwern (AFAIK) describe the LM as an agent very good at predicting-agent. By definition it will not do anything different from a simulator, since its “goal” literally encode all of its behavior.
    Yet there is a difference when scaling. If Gwern is right (or if LM because more like what he’s describing as they get bigger), then we end up with a single agent which we probably shouldn’t trust because of all our many worries with alignment. On the other hand, if scaled up LM are non-agentic/simulator-like, then they would stay motivationless, and there would be at least the possibility to use them to help alignment research for example, by trying to simulate non-agenty systems.
    It doesn’t seem like whether something is obvious or not should determine whether it is misaligned—it’s obvious that a very superintelligent paperclip maximizer would be bad, but clearly we should still call that misaligned.
    Fair enough.
    I think that’s primarily to emphasize why it is difficult to avoid specification gaming, not because those are the only examples of misalignment.
    Yeah, you’re probably right.
    - Rohin Shah 28 Jul 2021 17:16 UTC
      LW: 2 AF: 2
      AF Parent
      Yet there is a difference when scaling. If Gwern is right (or if LM because more like what he’s describing as they get bigger), then we end up with a single agent which we probably shouldn’t trust because of all our many worries with alignment. On the other hand, if scaled up LM are non-agentic/simulator-like, then they would stay motivationless, and there would be at least the possibility to use them to help alignment research for example, by trying to simulate non-agenty systems.
      Yeah, I agree that in the future there is a difference. I don’t think we know which of these situations we’re going to be in (which is maybe what you’re arguing). Idk what Gwern predicts.
      - adamShimi 28 Jul 2021 17:49 UTC
        LW: 2 AF: 1
        AF Parent
        Exactly. I’m mostly arguing that I don’t think the case for the agent situation is as clear cut as I’ve seen some people defend it, which doesn’t mean it’s not possibly true.