I think that this general point about not understanding LLMs is being pretty systematically overstated here and elsewhere in a few different ways.
(Nothing against the OP in particularly, which is trying to lean on the let’s use this politically. But leaning on things politically is not… probably… the best way to make those terms clearly used? Terms even more clear than “understand” are apt to break down under political pressure, and “understand” is already pretty floaty and a suitcase word)
What do I mean?
Well, two points.
If we don’t understand the forward pass of a LLM, then according to this use of “understanding” there are lots of other things we don’t understand that we nevertheless are deeply comfortable with.
Sure, we have an understanding of the dynamics of training loops and SGD’s properties, and we know how ML models’ architectures work. But we don’t know what specific algorithms ML models’ forward passes implement.
There are a lot of ways you can understand “understanding” the specific algorithm that ML models implement in their forward pass. You could say that understanding here means something like “You can turn implemented algorithm from a very densely connected causal graph with many nodes, into an abstract and sparsely connected causal graph with a handful of nodes with human readable labels, that lets you reason about what happens without knowing the densely connected graph.”
But like, we don’t understand lots of things in this way! And these things are nevertheless able to be engineered or predicted well, and which are not frightening at all. In this sense we also don’t understand:
Weather
The dynamics going on inside rocket exhaust, or a turbofan, or anything we model with CFD software
Every other single human’s brain on this planet
Probably our immune system
Or basically anything with chaotic dynamics. So sure, you can say we don’t understand the forward pass of an LLM, so we don’t understand them. But like—so what? Not everything in the world can be decomposed into a sparse causal graph, and we still say we understand such things. We basically understand weather. I’m still comfortable flying on a plane.
Inability to intervene effectively at every point in a causal process doesn’t mean that it’s unpredictable or hard to control from other nodes.
Or, at the very least, that it’s written in legible, human-readable and human-understandable format, and that we can interfere on it in order to cause precise, predictable changes.
Analogically—you cannot alter rocket exhaust in predictable ways, once it has been ignited. But, you can alter the rocket to make the exhaust do what you want.
Similarly, you cannot alter an already-made LLM in predictable ways without training it. But you can alter an LLM that you are training in.… really pretty predictable ways.
Like, here are some predictions:
(1) The LLMs that are good at chess have a bunch of chess in their training data, with absolutely 0.0 exceptions
(2) The first LLMs that are good agents will have a bunch of agentlike training data fed into them, and will be best at the areas for which they have the most high-quality data
(3) If you can get enough data to make an agenty LLM, you’ll be able to make an LLM that does pretty shittily on the MMLU relative to GPT-4 etc, but which is a very effective agent, by making “useful for agent” rather than “useful textbook knowledge” the criteria for inclusion in the training data. (MMLU is not an effective policy intervention target!
(4) Training is such an effective way of putting behavior into LLMs that even when interpretability is like, 20x better than it is now, people will still usually be using SGD or AdamW or whatever to give LLMs new behavior, even when weight-level interventions are possible.
So anyhow—the point is that the inability to intervene or alter a process at any point along the creation doesn’t mean that we cannot control it effectively at other points. We can control LLMs along other points.
(I think AI safety actually has a huge blindspot here—like, I think the preponderance of the evidence is that the effective way to control not merely LLMs but all AI is to understand much more precisely how they generalize from training data, rather than by trying to intervene in the created artifact. But there are like 10x more safety people looking into interpretability instead of how they generalize from data, as far as I can tell.)
But there are like 10x more safety people looking into interpretability instead of how they generalize from data, as far as I can tell.
I think interpretability is a really powerful lens for looking at how models generalize from data, partly just in terms of giving you a lot more stuff to look at than you would have purely by looking at model outputs.
If I want to understand the characteristics of how a car performs, I should of course spend some time driving the car around, measuring lots of things like acceleration curves and turning radius and power output and fuel consumption. But I should also pop open the hood, and try to figure out how the components interact, and how each component behaves in isolation in various situations, and, if possible, what that component’s environment looks like in various real-world conditions. (Also I should probably learn something about what roads are like, which I think would be analogous to “actually look at a representative sample of the training data”).
If we don’t understand the forward pass of a LLM, then according to this use of “understanding” there are lots of other things we don’t understand that we nevertheless are deeply comfortable with.
Solid points; I think my response to Steven broadly covers them too, though. In essence, the reasons we’re comfortable with some phenomenon/technology usually aren’t based on just one factor. And I think in the case of AIs, the assumption they’re legible and totally comprehended is one of the load-bearing reasons a lot of people are comfortable with them to begin with. “Just software”.
So explaining how very unlike normal software they are – that they’re as uncontrollable and chaotic as weather, as moody and incomprehensible as human brains – would… not actually sound irrelevant let alone reassuring to them.
I think this is more of a disagreement on messaging than a disagreement on facts.
I don’t see anyone disputing the “the AI is about as unpredictable as weather” claim, but it’s quite a stretch to summarize that as “we have no idea how the AI works.”
I understand that abbreviated and exaggerated messaging can be optimal for public messaging, but I don’t think there’s enough clarification in this post between direct in-group claims and examples of public messaging.
I would break this into three parts, to avoid misunderstandings from poorly contextualized language: 1. What is our level of understanding of AIs? 2. What is the general public’s expectation of our level of understanding? 3. What’s the best messaging to resolve this probable overestimation?
I don’t fully understand your implicatioks of why unpredictable things should not be frightening. In general, there is a difference between understanding and creating.
The weather is unpredictable but we did not create it; where we did and do create it, we indeed seem to be too careless. For human brains, we at least know that preferences are mostly not too crazy, and if they are, capabilities are not superhuman. With respect to the immune system, understanding may be not very deep, but intervention is mostly limited by understanding, and where that is not true, we may be in trouble.
But there are like 10x more safety people looking into interpretability instead of how they generalize from data, as far as I can tell.)
An intriguing observation. But the ability to extrapolate accurately outside the training data is a result of building accurate world models. So to understand this, we’d need to understand the sorts of world models that LLMs build and how they interact. I’m having some difficulty immediately thinking of a way of studying that that doesn’t require first being a lot better at interpretability than we are now. But if you can think of one, I’d love to hear it.
I’m having some difficulty immediately thinking of a way of studying that
Pretty sure that’s not what 1a3orn would say, but you can study efficient world-models directly to grok that. Instead of learning about them through the intermediary of extant AIs, you can study the thing that these AIs are trying to ever-better approximate itself.
I think that this general point about not understanding LLMs is being pretty systematically overstated here and elsewhere in a few different ways.
(Nothing against the OP in particularly, which is trying to lean on the let’s use this politically. But leaning on things politically is not… probably… the best way to make those terms clearly used? Terms even more clear than “understand” are apt to break down under political pressure, and “understand” is already pretty floaty and a suitcase word)
What do I mean?
Well, two points.
If we don’t understand the forward pass of a LLM, then according to this use of “understanding” there are lots of other things we don’t understand that we nevertheless are deeply comfortable with.
There are a lot of ways you can understand “understanding” the specific algorithm that ML models implement in their forward pass. You could say that understanding here means something like “You can turn implemented algorithm from a very densely connected causal graph with many nodes, into an abstract and sparsely connected causal graph with a handful of nodes with human readable labels, that lets you reason about what happens without knowing the densely connected graph.”
But like, we don’t understand lots of things in this way! And these things are nevertheless able to be engineered or predicted well, and which are not frightening at all. In this sense we also don’t understand:
Weather
The dynamics going on inside rocket exhaust, or a turbofan, or anything we model with CFD software
Every other single human’s brain on this planet
Probably our immune system
Or basically anything with chaotic dynamics. So sure, you can say we don’t understand the forward pass of an LLM, so we don’t understand them. But like—so what? Not everything in the world can be decomposed into a sparse causal graph, and we still say we understand such things. We basically understand weather. I’m still comfortable flying on a plane.
Inability to intervene effectively at every point in a causal process doesn’t mean that it’s unpredictable or hard to control from other nodes.
Analogically—you cannot alter rocket exhaust in predictable ways, once it has been ignited. But, you can alter the rocket to make the exhaust do what you want.
Similarly, you cannot alter an already-made LLM in predictable ways without training it. But you can alter an LLM that you are training in.… really pretty predictable ways.
Like, here are some predictions:
(1) The LLMs that are good at chess have a bunch of chess in their training data, with absolutely 0.0 exceptions
(2) The first LLMs that are good agents will have a bunch of agentlike training data fed into them, and will be best at the areas for which they have the most high-quality data
(3) If you can get enough data to make an agenty LLM, you’ll be able to make an LLM that does pretty shittily on the MMLU relative to GPT-4 etc, but which is a very effective agent, by making “useful for agent” rather than “useful textbook knowledge” the criteria for inclusion in the training data. (MMLU is not an effective policy intervention target!
(4) Training is such an effective way of putting behavior into LLMs that even when interpretability is like, 20x better than it is now, people will still usually be using SGD or AdamW or whatever to give LLMs new behavior, even when weight-level interventions are possible.
So anyhow—the point is that the inability to intervene or alter a process at any point along the creation doesn’t mean that we cannot control it effectively at other points. We can control LLMs along other points.
(I think AI safety actually has a huge blindspot here—like, I think the preponderance of the evidence is that the effective way to control not merely LLMs but all AI is to understand much more precisely how they generalize from training data, rather than by trying to intervene in the created artifact. But there are like 10x more safety people looking into interpretability instead of how they generalize from data, as far as I can tell.)
I think interpretability is a really powerful lens for looking at how models generalize from data, partly just in terms of giving you a lot more stuff to look at than you would have purely by looking at model outputs.
If I want to understand the characteristics of how a car performs, I should of course spend some time driving the car around, measuring lots of things like acceleration curves and turning radius and power output and fuel consumption. But I should also pop open the hood, and try to figure out how the components interact, and how each component behaves in isolation in various situations, and, if possible, what that component’s environment looks like in various real-world conditions. (Also I should probably learn something about what roads are like, which I think would be analogous to “actually look at a representative sample of the training data”).
Solid points; I think my response to Steven broadly covers them too, though. In essence, the reasons we’re comfortable with some phenomenon/technology usually aren’t based on just one factor. And I think in the case of AIs, the assumption they’re legible and totally comprehended is one of the load-bearing reasons a lot of people are comfortable with them to begin with. “Just software”.
So explaining how very unlike normal software they are – that they’re as uncontrollable and chaotic as weather, as moody and incomprehensible as human brains – would… not actually sound irrelevant let alone reassuring to them.
I think this is more of a disagreement on messaging than a disagreement on facts.
I don’t see anyone disputing the “the AI is about as unpredictable as weather” claim, but it’s quite a stretch to summarize that as “we have no idea how the AI works.”
I understand that abbreviated and exaggerated messaging can be optimal for public messaging, but I don’t think there’s enough clarification in this post between direct in-group claims and examples of public messaging.
I would break this into three parts, to avoid misunderstandings from poorly contextualized language:
1. What is our level of understanding of AIs?
2. What is the general public’s expectation of our level of understanding?
3. What’s the best messaging to resolve this probable overestimation?
I don’t fully understand your implicatioks of why unpredictable things should not be frightening. In general, there is a difference between understanding and creating. The weather is unpredictable but we did not create it; where we did and do create it, we indeed seem to be too careless. For human brains, we at least know that preferences are mostly not too crazy, and if they are, capabilities are not superhuman. With respect to the immune system, understanding may be not very deep, but intervention is mostly limited by understanding, and where that is not true, we may be in trouble.
An intriguing observation. But the ability to extrapolate accurately outside the training data is a result of building accurate world models. So to understand this, we’d need to understand the sorts of world models that LLMs build and how they interact. I’m having some difficulty immediately thinking of a way of studying that that doesn’t require first being a lot better at interpretability than we are now. But if you can think of one, I’d love to hear it.
Pretty sure that’s not what 1a3orn would say, but you can study efficient world-models directly to grok that. Instead of learning about them through the intermediary of extant AIs, you can study the thing that these AIs are trying to ever-better approximate itself.
See my (somewhat outdated) post on the matter, plus the natural-abstractions agenda.