In defense of probably wrong mechanistic models
This is a short post on a simple point that I get asked about a lot and want a canonical reference for.
Which of the following two options is more likely to be true?
AIs will internally be running explicit search processes.
AIs will internally be doing something weirder and more complicated than explicit search.
In my opinion, whenever you’re faced with a question about like this, it’s always weirder than you think, and you should pick option (2)—or the equivalent—every single time. The problem, though, is that while option (2) is substantially more likely to be correct, it’s not at all predictive—it’s effectively just the “not (1)” hypothesis, which gets a lot of probability mass because it covers a lot of the space, but precisely because it covers so much of the space is extremely difficult to operationalize to make any concrete predictions about what your AI will actually do.
The aphorism here is “All models are wrong, but some are useful.” Not having a model at all and just betting on the “something else” hypothesis is always going to be more likely than any specific model, but having specific models is nevertheless highly useful in a way that the “something else” hypothesis just isn’t.
Thus, in my opinion, I strongly believe that we should try our best to make lots of specific statements about internal structures even when we know those statements are likely to be wrong, because when we let ourselves make specific, structural, mechanistic models, we can get real, concrete predictions. And even if the model is literally false, to the extent that it has some plausible relationship to reality, the predictions that it makes can still be quite accurate.
Furthermore, one of my favorite strategies here is to come up with many different, independent mechanistic models and then see if they all converge: if you get the same prediction from lots of different mechanistic models, that adds a lot of credence to that prediction being quite robust. An example of this in the setting of modeling inductive biases is my “How likely is deceptive alignment?” post, where I take the two relatively independent—but both probably wrong—stories of high and low path-dependence and get the result that they both seem to imply a similar prediction about deceptive alignment, which I think lends a lot of credence to that prediction even if the specific models of inductive biases presented are unlikely to be literally correct.
Going back to the original question about explicit search, this is essentially how I like to think about the arguments in “Risks from Learned Optimization:” we argue that explicit search is a plausible model and explore what its predictions are. Though I think that the response “literally explicit search is unlikely” is potentially correct (though it depends on exactly how broad/narrow your understanding of explicit search is), it’s not very constructive—my response is usually, “okay, so what’s a better mechanistic model then?” That’s not to say that I don’t think there are any better mechanistic models than explicit search for what a powerful AI might be doing—but it is to say that coming up with some alternative mechanistic model is a necessary step of trying to improve on existing mechanistic models.
- But is it really in Rome? An investigation of the ROME model editing technique by 30 Dec 2022 2:40 UTC; 104 points) (
- Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations) by 22 Dec 2023 20:19 UTC; 74 points) (
- Clarifying mesa-optimization by 21 Mar 2023 15:53 UTC; 38 points) (
- 4 Dec 2023 18:22 UTC; 2 points) 's comment on TurnTrout’s shortform feed by (
- 29 Feb 2024 1:07 UTC; 0 points) 's comment on Counting arguments provide no evidence for AI doom by (
I think this is exactly wrong. I think that mainly because I personally went into biology research, twelve years ago, expecting systems to be fundamentally messy and uninterpretable, and it turned out that biological systems are far less messy than I expected.
We’ve also seen the same, in recent years, with neural nets. Early on, lots of people expected that the sort of interpretable structure found by Chris Olah & co wouldn’t exist. And yet, whenever we actually delve into these systems, it turns out that there’s a ton of ultimately-relatively-simple internal structure.
That said, it is a pattern that the simple interpretable structure of complex systems often does not match what humans studying them hypothesized a priori.
I’m not sure exactly what you mean by “ton of ultimately-relatively-simple internal structure”.
I’ll suppose you mean “a high percentage of what models use parameters for is ultimately simple to humans” (where by simple to humans we mean something like, description length in the prior of human knowledge, e.g., natural language).
If so, this hasn’t been my experience doing interp work or from the interp work I’ve seen (though it’s hard to tell: perhaps there exists a short explaination that hasn’t been found?). Beyond this, I don’t think you can/should make a large update (in either direction) from Olah et al’s prior work. The work should down-weight the probability of complete uninterpretablity or extreme easiness.
As such, I expect (and observe) that views about the tractability of humans understanding models come down largely to priors or evidence from other domains.
In the spirit of Evan’s original post here’s a (half baked) simple model:
Simplicity claims are claims about how many bits (in the human prior) it takes to explain[1] some amount of performance in the NN prior.
E.g., suppose we train a model which gets 2 nats of loss with 100 Billion parameters and we can explain this model getting 2.5 nats using a 300 KB human understandable manual (we might run into issues with irreducible complexity such that making a useful manual is hard, but let’s put that aside for now).
So, ‘simplicity’ of this sort is lower bounded by the relative parameter efficiency of neural networks in practice vs the human prior.
In practice, you do worse than this insofar as NNs express things which are anti-natural in the human prior (in terms of parameter efficiency).
We can also reason about how ‘compressible’ the explanation is in a naive prior (e.g., a formal framework for expressing explanations which doesn’t utilize cleverer reasoning technology than NNs themselves). I don’t quite mean compressible—presumably this ends up getting you insane stuff as compression usually does.
by explain, I mean something like the idea of heuristic arguments from ARC.
That’s fair—perhaps “messy” is the wrong word there. Maybe “it’s always weirder than you think”?
(Edited the post to “weirder.”)
Sounds closer. Maybe “there’s always surprises”? Or “your pre-existing models/tools/frames are always missing something”? Or “there are organizing principles, but you’re not going to guess all of them ahead of time”?
I’m currently writing up a post about the ROME intervention and its limitations. One point I want to illustrate is that the intervention is a bit more finicky than one might initially think. However, my hope is that such interventions, while not perfect at explaining something, will hopefully give us extra confidence in our interpretability results (in this case causal tracing).
If we do these types of interventions, I think we need to be careful about not inferring things about the model that isn’t there (facts are not highly localized in one layer).
So, in the context of this post, if we do find things that look like search, I agree that we should make specific statements about internal structures as well as find ways to validate those statements/hypotheses. However, let’s make sure we do keep in mind they are likely not exactly what we are modeling it to be (though we can still learn from them).
Agreed. It’s the same principle by which people are advised to engage in plan-making even if any specific plan they will invent will break on contact with reality; the same principle that underlies “do the math, then burn the math and go with your gut”.
While any specific model is likely to be wrong, trying to derive a consistent model gives you valuable insights into how a consistent model would look like at all, builds model-building skills. What specific externally-visible features of the system do you need to explain? How much complexity is required to do so? How does the process that created the system you’re modeling interact with its internals? How does the former influence the relative probabilities of different internal designs? How would you be able to distinguish one internal structure from another?
Thinking about concrete models forces you to, well, solidify your understanding of the subject matter into a concrete model — and that’s non-trivial in itself.
I’d done that exercise with a detailed story of AI agency development a few months ago, and while that model seems quite naive and uninformed to me now, having built it significantly improved my ability to understand others’ models, see where they connect and what they’re meant to explain.
(Separately, this is why I agree with e. g. Eliezer that people should have a concrete, detailed plan not just for technical alignment, but for how they’ll get the friendly AGI all the way to deployment and AI-Risk-amelioration in the realistic sociopolitical conditions. These plans won’t work as written, but they’ll orient you, give you an idea of how it even looks like to be succeeding at this task vs. failing.)
Thank you so much for the excellent and insightful post on mechanistic models, Evan!
My hypothesis is that the difficulty of finding mechanistic models that consistently make accurate predictions is likely due to the agent-environment system’s complexity and computational irreducibility. Such agent-environment interactions may be inherently unpredictable “because of the difficulty of pre-stating the relevant features of ecological niches, the complexity of ecological systems and [the fact that the agent-ecology interaction] can enable its own novel system states.”
Suppose that one wants to consistently make accurate predictions about a computationally irreducible agent-environment system. In general, the most efficient way to do so is to run the agent in the given environment. There are probably no shortcuts, even via mechanistic models.
For dangerous AI agents, an accurate simulation box of the deployment environment would be ideal for safe empiricism. This is probably intractable for many use cases of AI agents, but computational irreducibility implies that methods other than empiricism are probably even more intractable.
Please read my post “The limited upside of interpretability” for a detailed argument. It would be great to hear your thoughts!
You write that even if the mechanistic model is wrong, if it “has some plausible relationship to reality, the predictions that it makes can still be quite accurate.” I think that this is often true, and true in particular in the case at hand (explicit search vs not). However, I think there are many domains where this is false, where there is a large range of mechanistic models which are plausible but make very false predictions. This depends roughly on how much the details of the prediction vary depending on the details of the mechanistic model. In the explicit search case, it seems like many other plausible models for how RL agents might mechanistically function imply agent-ish behavior, even if the model is not primarily using explicit search. However, this is because, due to the fact that the agent must accomplish the training objective, the space of possible behaviors is heavily constrained. In questions where the prediction space is less constrained to begin with (e. g. questions about how the far future will go), different “mechanistic” explanations (for example, thinking that the far future will be controlled by a human superintelligence vs an alien superintelligence vs evolutionary dynamics) imply significantly different predictions.
Yes, I agree—this is why I say: