Here’s a mistake some people might be making with mechanistic interpretability theories of impact (and some other things, e.g. how much Neuroscience is useful for understanding AI or humans).
When there are multiple layers of abstraction that build up to a computation, understanding the low level doesn’t help much with understanding the high level.
Examples: 1. Understanding semiconductors and transistors doesn’t tell you much about programs running on the computer. The transistors can be reconfigured into a completely different computer, and you’ll still be able to run the same programs. To understand a program, you don’t need to be thinking about transistors or logic gates. Often you don’t even need to be thinking about the bit level representation of data.
2. The computation happening in single neurons in an artificial neural network doesn’t have have much relation to the computation happening at a high level. What I mean is that you can switch out activation functions, randomly connect neurons to other neurons, randomly share weights, replace small chunks of network with some other differentiable parameterized function. And assuming the thing is still trainable, the overall system will still learn to execute a function that is on a high level pretty similar to whatever high level function you started with.[1]
3. Understanding how neurons work doesn’t tell you much about how the brain works. Neuroscientists understand a lot about how neurons work. There are models that make good predictions about the behavior of individual neurons or synapses. I bet that the high level algorithms that are running in the brain are most naturally understood without any details about neurons at all. Neurons probably aren’t even a useful abstraction for that purpose.
Probably directions in activation space are also usually a bad abstraction for understanding how humans work, kinda analogous to how bit-vectors of memory are a bad abstraction for understanding how program works.
You can mess with inductive biases of the training process this way, which might change the function that gets learned, but (my impression is) usually not that much if you’re just messing with activation functions.
When analyzing complex systems (such as deep networks), it is tempting to separate the system into events or components (“parts”), analyze those parts separately, and combine results or “divide and conquer.”
This approach often wrongly assumes (Leveson 2020):
Separation does not distort the system’s properties
Each part operates independently
Part acts the same when examined singly as when acting in the whole
Parts are not subject to feedback loops and nonlinear interactions
Interactions between parts can be examined pairwise
Searching for mechanisms and reductionist analysis is too simplistic when dealing with complex systems (see our third post for more).
People hardly understand complex systems. Grad students in ML don’t even understand various aspects of their field, how to make a difference in it, what trends are emerging, or even what’s going on outside their small area. How will we understand an intelligence that moves more quickly and has more breadth? The reach of a human mind has limits. Perhaps a person could understand a small aspect of an agent’s actions (or components), but it’d be committing the composition fallacy to suggest a group of people that individually understand a part of an agent could understand the whole agent.
Yeah I think I agree. It also applies to most research about inductive biases of neural networks (and all of statistical learning theory). Not saying it won’t be useful, just that there’s a large mysterious gap between great learning theories and alignment solutions and inside that gap is (probably, usually) something like the levels-of-abstraction mistake.
This definitely sounds like a mistake someone could make while thinking about singular learning theory or neuroscience, but I don’t think it sounds like a mistake that I’m making? It does in fact seem a lot easier to go from a theory of how model structure, geometry, & rewards maps to goal generalization to a theory of values than it does to go from the mechanics of transistors to tetris, or the mechanics of neurons to a theory of values.
The former problem (structure-geometry-&-rewards-to-goals to value-theory) is not trivial, but seems like only one abstraction leap, while the other problems seem like very many abstraction leaps (7 to be exact, in the case of transistors->tetris).
The problem is not that abstraction is impossible, but that abstraction is hard, and you should expect to be done in 50 years if launching a research program requiring crossing 7 layers of abstraction (the time between Turing’s thesis, and making Tetris). If just crossing one layer, the same crude math says you should expect to be done in 7 years (edit: Though also going directly from a high level abstraction to an adjacently high level abstraction is a far easier search task than trying to connect a very low level abstraction to a very high level abstraction. This is also part of the mistake many make, and I claim its not a mistake that I’m making).
There aren’t really any non-extremely-leaky abstractions in big NNs on top of something like a “directions and simple functions on these directions” layer. (I originally heard this take from Buck)
It’s very hard to piece together understanding of NNs from these low-level components[1]
That said I also think it’s plausible that understanding the low-level could help a lot with understanding the high level even if there is a bunch of other needed work.
This will depend on the way in which you understand or operate on low level components of course. Like if you could predict behavior perfectly just from a short text description for all low level components then you’d be fine. But this is obviously impossible in the same way it’s impossible for transistors. You’ll have to make reference to other concepts etc, and then you’ll probably have a hard time.
There aren’t really any non-extremely-leaky abstractions in big NNs on top of something like a “directions and simple functions on these directions” layer. (I originally heard this take from Buck)
Of course this depends on what it’s trained to do? And it’s false for humans and animals and corporations and markets, we have pretty good abstractions that allow us to predict and sometimes modify the behavior of these entities.
I’d be pretty shocked if this statement was true for AGI.
I think this is going to depend on exactly what you mean by non-extremely-leaky abstractions.
For the notion I was thinking of, I think humans, animals, corporations, and markets don’t seem to have this.
I’m thinking of something like “some decomposition or guide which lets you accurately predict all behavior”. And then the question is how good are the best abstractions in such a decomposition.
There are obviously less complete abstractions.
(Tbc, there are abstractions on top of “atoms” in humans and abstractions on top of chemicals. But I’m not sure if there are very good abstractions on top of neurons which let you really understand everything that is going on.)
Ah I see, I was referring to less complete abstractions. The “accurately predict all behavior” definition is fine, but this comes with a scale of how accurate the prediction is. “Directions and simple functions on these directions” probably misses some tiny details like floating point errors, and if you wanted a human to understand it you’d have to use approximations that lose way more accuracy. I’m happy to lose accuracy in exchange for better predictions about behavior in previously-unobserved situations. In particular, it’s important to be able to work out what sort of previously-unobserved situation might lead to danger. We can do this with humans and animals etc, we can’t do it with “directions and simple functions on these directions”.
Here’s a mistake some people might be making with mechanistic interpretability theories of impact (and some other things, e.g. how much Neuroscience is useful for understanding AI or humans).
When there are multiple layers of abstraction that build up to a computation, understanding the low level doesn’t help much with understanding the high level.
Examples:
1. Understanding semiconductors and transistors doesn’t tell you much about programs running on the computer. The transistors can be reconfigured into a completely different computer, and you’ll still be able to run the same programs. To understand a program, you don’t need to be thinking about transistors or logic gates. Often you don’t even need to be thinking about the bit level representation of data.
2. The computation happening in single neurons in an artificial neural network doesn’t have have much relation to the computation happening at a high level. What I mean is that you can switch out activation functions, randomly connect neurons to other neurons, randomly share weights, replace small chunks of network with some other differentiable parameterized function. And assuming the thing is still trainable, the overall system will still learn to execute a function that is on a high level pretty similar to whatever high level function you started with.[1]
3. Understanding how neurons work doesn’t tell you much about how the brain works. Neuroscientists understand a lot about how neurons work. There are models that make good predictions about the behavior of individual neurons or synapses. I bet that the high level algorithms that are running in the brain are most naturally understood without any details about neurons at all. Neurons probably aren’t even a useful abstraction for that purpose.
Probably directions in activation space are also usually a bad abstraction for understanding how humans work, kinda analogous to how bit-vectors of memory are a bad abstraction for understanding how program works.
Of course John has said this better.
You can mess with inductive biases of the training process this way, which might change the function that gets learned, but (my impression is) usually not that much if you’re just messing with activation functions.
Also note that (e.g.) Dan H.[1] has also advocated for some version of this take. See for instance the Open Problems in AI X-Risk (Pragmatic AI safety #5) section on criticisms of transparency:
I think Dan is the source of this take in the post I link rather than the other author Thomas W, but not super confident.
I think this applies to Garrett Baker’s hopes for the application of singular learning theory to decoding human values.
Yeah I think I agree. It also applies to most research about inductive biases of neural networks (and all of statistical learning theory). Not saying it won’t be useful, just that there’s a large mysterious gap between great learning theories and alignment solutions and inside that gap is (probably, usually) something like the levels-of-abstraction mistake.
This definitely sounds like a mistake someone could make while thinking about singular learning theory or neuroscience, but I don’t think it sounds like a mistake that I’m making? It does in fact seem a lot easier to go from a theory of how model structure, geometry, & rewards maps to goal generalization to a theory of values than it does to go from the mechanics of transistors to tetris, or the mechanics of neurons to a theory of values.
The former problem (structure-geometry-&-rewards-to-goals to value-theory) is not trivial, but seems like only one abstraction leap, while the other problems seem like very many abstraction leaps (7 to be exact, in the case of transistors->tetris).
The problem is not that abstraction is impossible, but that abstraction is hard, and you should expect to be done in 50 years if launching a research program requiring crossing 7 layers of abstraction (the time between Turing’s thesis, and making Tetris). If just crossing one layer, the same crude math says you should expect to be done in 7 years (edit: Though also going directly from a high level abstraction to an adjacently high level abstraction is a far easier search task than trying to connect a very low level abstraction to a very high level abstraction. This is also part of the mistake many make, and I claim its not a mistake that I’m making).
My guess is that it’s pretty likely that all of:
There aren’t really any non-extremely-leaky abstractions in big NNs on top of something like a “directions and simple functions on these directions” layer. (I originally heard this take from Buck)
It’s very hard to piece together understanding of NNs from these low-level components[1]
It’s even worse if your understanding of low-level components is poor (only a small fraction of the training compute is explained)
That said I also think it’s plausible that understanding the low-level could help a lot with understanding the high level even if there is a bunch of other needed work.
This will depend on the way in which you understand or operate on low level components of course. Like if you could predict behavior perfectly just from a short text description for all low level components then you’d be fine. But this is obviously impossible in the same way it’s impossible for transistors. You’ll have to make reference to other concepts etc, and then you’ll probably have a hard time.
Of course this depends on what it’s trained to do? And it’s false for humans and animals and corporations and markets, we have pretty good abstractions that allow us to predict and sometimes modify the behavior of these entities.
I’d be pretty shocked if this statement was true for AGI.
I think this is going to depend on exactly what you mean by non-extremely-leaky abstractions.
For the notion I was thinking of, I think humans, animals, corporations, and markets don’t seem to have this.
I’m thinking of something like “some decomposition or guide which lets you accurately predict all behavior”. And then the question is how good are the best abstractions in such a decomposition.
There are obviously less complete abstractions.
(Tbc, there are abstractions on top of “atoms” in humans and abstractions on top of chemicals. But I’m not sure if there are very good abstractions on top of neurons which let you really understand everything that is going on.)
Ah I see, I was referring to less complete abstractions. The “accurately predict all behavior” definition is fine, but this comes with a scale of how accurate the prediction is. “Directions and simple functions on these directions” probably misses some tiny details like floating point errors, and if you wanted a human to understand it you’d have to use approximations that lose way more accuracy. I’m happy to lose accuracy in exchange for better predictions about behavior in previously-unobserved situations. In particular, it’s important to be able to work out what sort of previously-unobserved situation might lead to danger. We can do this with humans and animals etc, we can’t do it with “directions and simple functions on these directions”.