Use the length of the shortest interpretability explanation of behaviours of the model as a training loss for ELK—the idea is that models with shorter explanations are less likely to include human simulations / you can tell if they do.
Maybe this is not the right place to ask this, but how does this not just give you a simplicity prior?
By explanation, I think we mean ‘reason why a thing happens’ in some intuitive (and underspecified) sense.
Explanation length gets at something like “how can you cluster/compress a justification for the way the program responds to inputs” (where justification is doing a lot of work).
So, while the program itself is a great way to compress how the program responds to inputs, it doesn’t justify why the program responds this way to inputs. Thus program length/simplicity prior isn’t equivalent.
Here are some examples demonstrating where (I think) these priors differ:
The axioms of arithmetic don’t explain why the primes have a certain frequency—there is a short justification for this, but it’s longer than just the axioms and has to include the axioms.
The explanation of why code golfed programs work is often longer than the programs (at least in English)
The shortest explanation for ‘the SHA-512 hash of the first 2000 primes is x’ probably has to include a full (long) computation trace despite the fact that a program which computes/checks this can be short.
Here’s a short and bad explanation for why this is maybe useful for ELK.
The reason the good reporter works is because it accesses the model’s concept
for X and directly outputs it. The reason other possible reporter heads work is because
they access the model’s concept for X and then do something with that (where
the ‘doing something’ might be done in the core model or in the head).
So, the explanation for why the other heads work still has to go through the
concept for X, but then has some other stuff tacked on and must be longer than
the good reporter.
The reason other possible reporter heads work is because they access the model’s concept for X and then do something with that (where the ‘doing something’ might be done in the core model or in the head).
I definitely think there are bad reporter heads that don’t ever have to access X. E.g. the human imitator only accesses X if X is required to model humans, which is certainly not the case for all X.
Seems like a simplicity prior over explanations of model behavior is not the same as a simplicity prior over models? E.g. simplicity of explanation of a particular computation is a bit more like a speed prior. I don’t understand exactly what’s meant by explanations here. For some kinds of attribution, you can definitely have a simple explanation for a complicated circuit and/long-running computation—e.g. if under a relevant input distribution, one input almost always determines the output of a complicated computation.
E.g. simplicity of explanation of a particular computation is a bit more like a speed prior.
I don’t think that the size of an explanation/proof of correctness for a program should be very related to how long that program runs—e.g. it’s not harder to prove something about a program with larger loop bounds, since you don’t have to unroll the loop, you just have to demonstrate a loop invariant.
Maybe this is not the right place to ask this, but how does this not just give you a simplicity prior?
By explanation, I think we mean ‘reason why a thing happens’ in some intuitive (and underspecified) sense. Explanation length gets at something like “how can you cluster/compress a justification for the way the program responds to inputs” (where justification is doing a lot of work). So, while the program itself is a great way to compress how the program responds to inputs, it doesn’t justify why the program responds this way to inputs. Thus program length/simplicity prior isn’t equivalent. Here are some examples demonstrating where (I think) these priors differ:
The axioms of arithmetic don’t explain why the primes have a certain frequency—there is a short justification for this, but it’s longer than just the axioms and has to include the axioms.
The explanation of why code golfed programs work is often longer than the programs (at least in English)
The shortest explanation for ‘the SHA-512 hash of the first 2000 primes is x’ probably has to include a full (long) computation trace despite the fact that a program which computes/checks this can be short.
Here’s a short and bad explanation for why this is maybe useful for ELK.
The reason the good reporter works is because it accesses the model’s concept for X and directly outputs it. The reason other possible reporter heads work is because they access the model’s concept for X and then do something with that (where the ‘doing something’ might be done in the core model or in the head).
So, the explanation for why the other heads work still has to go through the concept for X, but then has some other stuff tacked on and must be longer than the good reporter.
I definitely think there are bad reporter heads that don’t ever have to access X. E.g. the human imitator only accesses X if X is required to model humans, which is certainly not the case for all X.
Seems like a simplicity prior over explanations of model behavior is not the same as a simplicity prior over models? E.g. simplicity of explanation of a particular computation is a bit more like a speed prior. I don’t understand exactly what’s meant by explanations here. For some kinds of attribution, you can definitely have a simple explanation for a complicated circuit and/long-running computation—e.g. if under a relevant input distribution, one input almost always determines the output of a complicated computation.
I don’t think that the size of an explanation/proof of correctness for a program should be very related to how long that program runs—e.g. it’s not harder to prove something about a program with larger loop bounds, since you don’t have to unroll the loop, you just have to demonstrate a loop invariant.
Perhaps you meant shouldn’t?
Honestly, I don’t understand ELK well enough (yet!) to meaningfully comment. That one came from Tao Lin, who’s a better person to ask.