We’ve also seen the same, in recent years, with neural nets. Early on, lots of people expected that the sort of interpretable structure found by Chris Olah & co wouldn’t exist. And yet, whenever we actually delve into these systems, it turns out that there’s a ton of ultimately-relatively-simple internal structure.
That said, it is a pattern that the simple interpretable structure of complex systems often does not match what humans studying them hypothesized a priori.
And yet, whenever we actually delve into these systems, it turns out that there’s a ton of ultimately-relatively-simple internal structure.
I’m not sure exactly what you mean by “ton of ultimately-relatively-simple internal structure”.
I’ll suppose you mean “a high percentage of what models use parameters for is ultimately simple to humans” (where by simple to humans we mean something like, description length in the prior of human knowledge, e.g., natural language).
If so, this hasn’t been my experience doing interp work or from the interp work I’ve seen (though it’s hard to tell: perhaps there exists a short explaination that hasn’t been found?). Beyond this, I don’t think you can/should make a large update (in either direction) from Olah et al’s prior work. The work should down-weight the probability of complete uninterpretablity or extreme easiness.
As such, I expect (and observe) that views about the tractability of humans understanding models come down largely to priors or evidence from other domains.
In the spirit of Evan’s original post here’s a (half baked) simple model:
Simplicity claims are claims about how many bits (in the human prior) it takes to explain[1] some amount of performance in the NN prior.
E.g., suppose we train a model which gets 2 nats of loss with 100 Billion parameters and we can explain this model getting 2.5 nats using a 300 KB human understandable manual (we might run into issues with irreducible complexity such that making a useful manual is hard, but let’s put that aside for now).
So, ‘simplicity’ of this sort is lower bounded by the relative parameter efficiency of neural networks in practice vs the human prior.
In practice, you do worse than this insofar as NNs express things which are anti-natural in the human prior (in terms of parameter efficiency).
We can also reason about how ‘compressible’ the explanation is in a naive prior (e.g., a formal framework for expressing explanations which doesn’t utilize cleverer reasoning technology than NNs themselves). I don’t quite mean compressible—presumably this ends up getting you insane stuff as compression usually does.
Sounds closer. Maybe “there’s always surprises”? Or “your pre-existing models/tools/frames are always missing something”? Or “there are organizing principles, but you’re not going to guess all of them ahead of time”?
I think this is exactly wrong. I think that mainly because I personally went into biology research, twelve years ago, expecting systems to be fundamentally messy and uninterpretable, and it turned out that biological systems are far less messy than I expected.
We’ve also seen the same, in recent years, with neural nets. Early on, lots of people expected that the sort of interpretable structure found by Chris Olah & co wouldn’t exist. And yet, whenever we actually delve into these systems, it turns out that there’s a ton of ultimately-relatively-simple internal structure.
That said, it is a pattern that the simple interpretable structure of complex systems often does not match what humans studying them hypothesized a priori.
I’m not sure exactly what you mean by “ton of ultimately-relatively-simple internal structure”.
I’ll suppose you mean “a high percentage of what models use parameters for is ultimately simple to humans” (where by simple to humans we mean something like, description length in the prior of human knowledge, e.g., natural language).
If so, this hasn’t been my experience doing interp work or from the interp work I’ve seen (though it’s hard to tell: perhaps there exists a short explaination that hasn’t been found?). Beyond this, I don’t think you can/should make a large update (in either direction) from Olah et al’s prior work. The work should down-weight the probability of complete uninterpretablity or extreme easiness.
As such, I expect (and observe) that views about the tractability of humans understanding models come down largely to priors or evidence from other domains.
In the spirit of Evan’s original post here’s a (half baked) simple model:
Simplicity claims are claims about how many bits (in the human prior) it takes to explain[1] some amount of performance in the NN prior.
E.g., suppose we train a model which gets 2 nats of loss with 100 Billion parameters and we can explain this model getting 2.5 nats using a 300 KB human understandable manual (we might run into issues with irreducible complexity such that making a useful manual is hard, but let’s put that aside for now).
So, ‘simplicity’ of this sort is lower bounded by the relative parameter efficiency of neural networks in practice vs the human prior.
In practice, you do worse than this insofar as NNs express things which are anti-natural in the human prior (in terms of parameter efficiency).
We can also reason about how ‘compressible’ the explanation is in a naive prior (e.g., a formal framework for expressing explanations which doesn’t utilize cleverer reasoning technology than NNs themselves). I don’t quite mean compressible—presumably this ends up getting you insane stuff as compression usually does.
by explain, I mean something like the idea of heuristic arguments from ARC.
That’s fair—perhaps “messy” is the wrong word there. Maybe “it’s always weirder than you think”?
(Edited the post to “weirder.”)
Sounds closer. Maybe “there’s always surprises”? Or “your pre-existing models/tools/frames are always missing something”? Or “there are organizing principles, but you’re not going to guess all of them ahead of time”?