Matthew Barnett comments on Relaxed adversarial training for inner alignment

Matthew Barnett 18 Oct 2019 2:14 UTC
LW: 4 AF: 3
AF
For the Alignment Newsletter:
Summary:
Previously, Paul Christiano proposed creating an adversary to search for inputs that would make a powerful model behave “unacceptably” and then penalizing the model accordingly. To make the adversary’s job easier, Paul relaxed the problem so that it only needed to find a pseudo-input, which can be thought of as predicate that constrains possible inputs. This post expands on Paul’s proposal by first defining a formal unacceptability penalty and then analyzing a number of scenarios in light of this framework. The penalty relies on the idea of an amplified model inspecting an unamplified version of itself. For this procedure to work, amplified overseers must be able to correctly deduce whether potential inputs will yield unacceptable behavior in their unamplified selves, which seems plausible since it should know everything the unamplified version does. The post concludes by arguing that progress in model transparency is key to these acceptability guarantees. In particular, Evan emphasizes the need to decompose models into the parts involved in their internal optimization processes, such as their world models, optimization procedures, and objectives.
Opinion:
I agree that transparency is an important condition for the adversary, since it would be hard to search for catastrophe-inducing inputs without details of how the model operated. I’m less certain that this particular decomposition of machine learning models is necessary. More generally, I am excited to see how adversarial training can help with inner alignment.
- Rohin Shah 19 Oct 2019 19:33 UTC
  LW: 3 AF: 2
  AF Parent
  My opinion, also going into the newsletter:
  Like Matthew, I’m excited to see more work on transparency and adversarial training for inner alignment. I’m a somewhat skeptical of the value of work that plans to decompose future models into a “world model”, “search” and “objective”: I would guess that there are many ways to achieve intelligent cognition that don’t easily factor into any of these concepts. It seems fine to study a system composed of a world model, search and objective in order to gain conceptual insight; I’m more worried about proposing it as an actual plan.
  - evhub 19 Oct 2019 21:16 UTC
    LW: 3 AF: 2
    AF Parent
    The point about decompositions is a pretty minor portion of this post; is there a reason you think that part is more worthwhile to focus on for the newsletter?
    - Rohin Shah 21 Oct 2019 6:17 UTC
      LW: 5 AF: 4
      AF Parent
      That’s… a fair point. It does make up a substantial portion of the transparency section, which seems like the “solutions” part of this post, but it isn’t the entire post.
      Matthew’s certainly right that I tend to reply to things I disagree with, though I usually try to avoid disagreeing with details. I’m not sure that I only disagree with details here, but I can’t clearly articulate what about this feels off to me. I’ll delete the opinion altogether; I’m not going to put an unclear opinion in the newsletter.
    - Matthew Barnett 19 Oct 2019 21:35 UTC
      LW: 2 AF: 2
      AF Parent
      I’m not Rohin, but I think there’s a tendency to reply to things you disagree with rather than things you agree with. That would explain my emphasis anyway.