Beth Barnes comments on A Longlist of Theories of Impact for Interpretability

Beth Barnes 13 Apr 2022 13:35 UTC
LW: 2 AF: 1
AF
Seems like a simplicity prior over explanations of model behavior is not the same as a simplicity prior over models? E.g. simplicity of explanation of a particular computation is a bit more like a speed prior. I don’t understand exactly what’s meant by explanations here. For some kinds of attribution, you can definitely have a simple explanation for a complicated circuit and/long-running computation—e.g. if under a relevant input distribution, one input almost always determines the output of a complicated computation.
- evhub 13 Apr 2022 19:16 UTC
  LW: 2 AF: 2
  AF Parent
  
  E.g. simplicity of explanation of a particular computation is a bit more like a speed prior.
  
  I don’t think that the size of an explanation/proof of correctness for a program should be very related to how long that program runs—e.g. it’s not harder to prove something about a program with larger loop bounds, since you don’t have to unroll the loop, you just have to demonstrate a loop invariant.
  - ryan_greenblatt 15 Apr 2022 16:48 UTC
    1 point
    Parent
    
    should be very related
    
    Perhaps you meant shouldn’t?
    - evhub 15 Apr 2022 20:54 UTC
      3 points
      Parent
      
      I don’t think