David Johnston comments on How likely is deceptive alignment?

David Johnston 8 Nov 2022 5:13 UTC
LW: 3 AF: 2
2
AF
A few questions, if you have time:
- which situation do you think is easier to analyse, path dependent or path independent?
- which of your analyses do you think is more robust, path dependent or independent? Is the difference large?
- I think the path dependent analysis should feature an assumption that, at one extreme, yields the path independent analysis, but I can’t see it. What say you?
- evhub 15 Nov 2022 21:23 UTC
  LW: 2 AF: 2
  2
  AF Parent
  - The path-independent case is probably overall easier to analyze—we understand things like speed and simplicity pretty well, and in the path-independent case we don’t have to keep track of complex sequencing effects.
  - I think my path-independent analysis is more robust, for the same reasons as in the first point.
  - Presumably that assumption is path-dependence? I’m not sure what you mean.
  - David Johnston 15 Nov 2022 23:54 UTC
    3 points
    0
    Parent
    Thanks for taking the time to respond. To explain my third question, my take on your path dependent analysis is that you have two basic assumptions:
    
    each step of training updates the behaviour in the direction of “locally minimising loss over all training data”
    training will not move the model between states with equal loss over all training data
    
    Holding the training data fixed, you get the same sequence of updates no matter which fragment is used to work out the next training step. So you can get very different behaviour for different training data, but not for different orderings of the same training data—so, to begin with, I’m not sure if these assumptions actually yield path dependence.
    
    Secondly, assumption 2 might seem like an assumption that gets you path dependence—e.g. if there are lots of global minima, then you can just end up at one of them randomly. However, replacing assumption 2 with some kind of “given a collection of states with minimal loss, the model always goes to some preferred state from this collection” doesn’t get you the path independent analysis. Instead of “the model converges to some behaviour that optimally trades off loss and inductive bias”, you end up with “the model converges to some behaviour that minimises training set loss”. That is, your analysis of “path independence” seems to be better described as “inductive bias independence” (or at least “weak inductive bias”), and the appropriate conclusion of this analysis would therefore seem to be that the model doesn’t generalise at all (not that it is merely deceptive).