evhub comments on A Longlist of Theories of Impact for Interpretability

evhub 12 Mar 2022 8:47 UTC
LW: 11 AF: 6
AF

Use the length of the shortest interpretability explanation of behaviours of the model as a training loss for ELK—the idea is that models with shorter explanations are less likely to include human simulations / you can tell if they do.

Maybe this is not the right place to ask this, but how does this not just give you a simplicity prior?
- ryan_greenblatt 16 Apr 2022 1:14 UTC
  LW: 3 AF: 3
  AF Parent
  By explanation, I think we mean ‘reason why a thing happens’ in some intuitive (and underspecified) sense. Explanation length gets at something like “how can you cluster/compress a justification for the way the program responds to inputs” (where justification is doing a lot of work). So, while the program itself is a great way to compress how the program responds to inputs, it doesn’t justify why the program responds this way to inputs. Thus program length/simplicity prior isn’t equivalent. Here are some examples demonstrating where (I think) these priors differ:
  - The axioms of arithmetic don’t explain why the primes have a certain frequency—there is a short justification for this, but it’s longer than just the axioms and has to include the axioms.
  - The explanation of why code golfed programs work is often longer than the programs (at least in English)
  - The shortest explanation for ‘the SHA-512 hash of the first 2000 primes is x’ probably has to include a full (long) computation trace despite the fact that a program which computes/checks this can be short.
  Here’s a short and bad explanation for why this is maybe useful for ELK.
  
  The reason the good reporter works is because it accesses the model’s concept for X and directly outputs it. The reason other possible reporter heads work is because they access the model’s concept for X and then do something with that (where the ‘doing something’ might be done in the core model or in the head).
  
  So, the explanation for why the other heads work still has to go through the concept for X, but then has some other stuff tacked on and must be longer than the good reporter.
  - evhub 17 Apr 2022 18:27 UTC
    LW: 2 AF: 2
    AF Parent
    
    The reason other possible reporter heads work is because they access the model’s concept for X and then do something with that (where the ‘doing something’ might be done in the core model or in the head).
    
    I definitely think there are bad reporter heads that don’t ever have to access X. E.g. the human imitator only accesses X if X is required to model humans, which is certainly not the case for all X.
- Beth Barnes 13 Apr 2022 13:35 UTC
  LW: 2 AF: 1
  AF Parent
  Seems like a simplicity prior over explanations of model behavior is not the same as a simplicity prior over models? E.g. simplicity of explanation of a particular computation is a bit more like a speed prior. I don’t understand exactly what’s meant by explanations here. For some kinds of attribution, you can definitely have a simple explanation for a complicated circuit and/long-running computation—e.g. if under a relevant input distribution, one input almost always determines the output of a complicated computation.
  - evhub 13 Apr 2022 19:16 UTC
    LW: 2 AF: 2
    AF Parent
    
    E.g. simplicity of explanation of a particular computation is a bit more like a speed prior.
    
    I don’t think that the size of an explanation/proof of correctness for a program should be very related to how long that program runs—e.g. it’s not harder to prove something about a program with larger loop bounds, since you don’t have to unroll the loop, you just have to demonstrate a loop invariant.
    - ryan_greenblatt 15 Apr 2022 16:48 UTC
      1 point
      Parent
      
      should be very related
      
      Perhaps you meant shouldn’t?
      - evhub 15 Apr 2022 20:54 UTC
        3 points
        Parent
        
        I don’t think
- Neel Nanda 12 Mar 2022 17:53 UTC
  LW: 2 AF: 1
  AF Parent
  Honestly, I don’t understand ELK well enough (yet!) to meaningfully comment. That one came from Tao Lin, who’s a better person to ask.