Rohin Shah comments on Relaxed adversarial training for inner alignment

Rohin Shah 19 Oct 2019 19:33 UTC
LW: 3 AF: 2
AF
My opinion, also going into the newsletter:
Like Matthew, I’m excited to see more work on transparency and adversarial training for inner alignment. I’m a somewhat skeptical of the value of work that plans to decompose future models into a “world model”, “search” and “objective”: I would guess that there are many ways to achieve intelligent cognition that don’t easily factor into any of these concepts. It seems fine to study a system composed of a world model, search and objective in order to gain conceptual insight; I’m more worried about proposing it as an actual plan.
- evhub 19 Oct 2019 21:16 UTC
  LW: 3 AF: 2
  AF Parent
  The point about decompositions is a pretty minor portion of this post; is there a reason you think that part is more worthwhile to focus on for the newsletter?
  - Rohin Shah 21 Oct 2019 6:17 UTC
    LW: 5 AF: 4
    AF Parent
    That’s… a fair point. It does make up a substantial portion of the transparency section, which seems like the “solutions” part of this post, but it isn’t the entire post.
    Matthew’s certainly right that I tend to reply to things I disagree with, though I usually try to avoid disagreeing with details. I’m not sure that I only disagree with details here, but I can’t clearly articulate what about this feels off to me. I’ll delete the opinion altogether; I’m not going to put an unclear opinion in the newsletter.
  - Matthew Barnett 19 Oct 2019 21:35 UTC
    LW: 2 AF: 2
    AF Parent
    I’m not Rohin, but I think there’s a tendency to reply to things you disagree with rather than things you agree with. That would explain my emphasis anyway.