Vanessa Kosoy comments on Formal Solution to the Inner Alignment Problem

Vanessa Kosoy 27 Feb 2021 23:55 UTC
1 point
AF
You have no guarantees, sure, but that’s a problem with deep learning in general and not just inner alignment. The point is, if your model is not optimizing that reward function then its performance during training will be suboptimal. To the extent your algorithm is able to approximate the true optimum during training, it will behave safely during deployment.
- evhub 28 Feb 2021 0:16 UTC
  LW: 2 AF: 2
  AF Parent
  
  that’s a problem with deep learning in general and not just inner alignment
  
  I think you are understanding inner alignment very differently than we define it in Risks from Learned Optimization, where we introduced the term.
  
  The point is, if your model is not optimizing that reward function then its performance during training will be suboptimal.
  
  This is not true for deceptively aligned models, which is the situation I’m most concerned about, and—as we argue extensively in Risks from Learned Optimization—there are a lot of reasons why a model might end up pursuing a simpler/faster/easier-to-find proxy even if that proxy yields suboptimal training performance.
  - michaelcohen 28 Feb 2021 11:30 UTC
    LW: 1 AF: 1
    AF Parent
    It may be helpful to point to specific sections of such a long paper.
    (Also, I agree that a neural network trained trained with that reward could produce a deceptive model that makes a well-timed error.)