(Related: Inaccessible Information, What does the universal prior actually look like?, Learning the prior)
Fitting a neural net implicitly uses a “wrong” prior. This makes neural nets more data hungry and makes them generalize in ways we don’t endorse, but it’s not clear whether it’s an alignment problem.
After all, if neural nets are what works, then both the aligned and unaligned AIs will be using them. It’s not clear if that systematically disadvantages aligned AI.
Unfortunately I think it’s an alignment problem:
I think the neural net prior may work better for agents with certain kinds of simple goals, as described in Inaccessible Information. The problem is that the prior mismatch may bite harder for some kinds of questions, and some agents simply never need to answer those hard questions.
I think that Solomonoff induction generalizes catastrophically because it becomes dominated by consequentialists who use better priors.
In this post I want to try to build some intuition for this problem, and then explain why I’m currently feeling excited about learning the right prior.
Indirect specifications in universal priors
We usually work with very broad “universal” priors, both in theory (e.g. Solomonoff induction) and in practice (deep neural nets are a very broad hypothesis class). For simplicity I’ll talk about the theoretical setting in this section, but I think the points apply equally well in practice.
The classic universal prior is a random output from a random stochastic program. We often think of the question “which universal prior should we use?” as equivalent to the question “which programming language should we use?” but I think that’s a loaded way of thinking about it — not all universal priors are defined by picking a random program.
A universal prior can never be too wrong — a prior P is universal if, for any other computable prior Q, there is some constant c such that, for all x, we have P(x) > c Q(x). That means that given enough data, any two universal priors will always converge to the same conclusions, and no computable prior will do much better than them.
Unfortunately, universality is much less helpful in the finite data regime. The first warning sign is that our “real” beliefs about the situation can appear in the prior in two different ways:
Directly: if our beliefs about the world are described by a simple computable predictor, they are guaranteed to appear in a universal prior with significant weight.
Indirectly: the universal prior also “contains” other programs that are themselves acting as priors. For example, suppose I use a universal prior with a terribly inefficient programming language, in which each character needed to be repeated 10 times in order for the program to do anything non-trivial. This prior is still universal, but it’s reasonably likely that the “best” explanation for some data will be to first sample a really simple interpret for a better programming language, and then draw a uniformly randomly program in that better programming language.
(There isn’t a bright line between these two kinds of posterior, but I think it’s extremely helpful for thinking intuitively about what’s going on.)
Our “real” belief is more like the direct model — we believe that the universe is a lawful and simple place, not that the universe is a hypothesis of some agent trying to solve a prediction problem.
Unfortunately, for realistic sequences and conventional universal priors, I think that indirect models are going to dominate. The problem is that “draw a random program” isn’t actually a very good prior, even if the programming language is OK— if I were an intelligent agent, even if I knew nothing about the particular world I lived in, I could do a lot of a priori reasoning to arrive at a much better prior.
The conceptually simplest example is “I think therefore I am.” Our hypotheses about the world aren’t just arbitrary programs that produce our sense experiences— we restrict attention to hypotheses that explain why we exist and for which it matters what we do. This rules out the overwhelming majority of programs, allowing us to assign significantly higher prior probability to the real world.
I can get other advantages from a priori reasoning, though they are a little bit more slippery to talk about. For example, I can think about what kinds of specifications make sense and really are most likely a priori, rather than using an arbitrary programming language.
The upshot is that an agent who is trying to do something, and has enough time to think, actually seems to implement a much better prior than a uniformly random program. If the complexity of specifying such an agent is small relative to the prior improbability of the sequence we are trying to predict, then I think the universal prior is likely to pick out the sequence indirectly by going through the agent (or else in some even weirder way).
I make this argument in the case of Solomonoff induction in What does the universal prior actually look like? I find that argument pretty convincing, although Solomonoff induction is weird enough that I expect most people to bounce off that post.
I make this argument in a much more realistic setting in Inaccessible Information. There I argue that if we e.g. use a universal prior to try to produce answers to informal questions in natural language, we are very likely to get an indirect specification via an agent who reasons about how we use language.
Why is this a problem?
I’ve argued that the universal prior learns about the world indirectly, by first learning a new better prior. Is that a problem?
To understand how the universal prior generalizes, we now need to think about how the learned prior generalizes.
The learned prior is itself a program that reasons about the world. In both of the cases above (Solomonoff induction and neural nets) I’ve argued that the simplest good priors will be goal-directed, i.e. will be trying to produce good predictions.
I have two different concerns with this situation, both of which I consider serious:
Bad generalizations may disadvantage aligned agents. The simplest version of “good predictions” may not generalize to some of the questions we care about, and may put us at a disadvantage relative to agents who only care about simpler questions. (See Inaccessible Information.)
Treacherous behavior. Some goals might be easier to specify than others, and a wide range of goals may converge instrumentally to “make good predictions.” In this case, the simplest programs that predict well might be trying to do something totally unrelated, when they no longer have instrumental reasons to predict well (e.g. when their predictions can no longer be checked) they may do something we regard as catastrophic.
I think it’s unclear how serious these problems are in practice. But I think they are huge obstructions from a theoretical perspective, and I think there is a reasonable chance that this will bite us in practice. Even if they aren’t critical in practice, I think that it’s methodologically worthwhile to try to find a good scalable solution to alignment, rather than having a solution that’s contingent on unknown empirical features of future AI.
Learning a competitive prior
Fundamentally, I think our mistake was building a system that uses the wrong universal prior, one that fails to really capture our beliefs. Within that prior, there are other agents who use a better prior, and those agents are able to outcompete and essentially take over the whole system.
I’ve considered lots of approaches that try to work around this difficulty, taking for granted that we won’t have the right prior and trying to somehow work around the risky consequences. But now I’m most excited about the direct approach: give our original system the right prior so that sub-agents won’t be able to outcompete it.
This roughly tracks what’s going on in our real beliefs, and why it seems absurd to us to infer that the world is a dream of a rational agent—why think that the agent will assign higher probability to the real world than the “right” prior? (The simulation argument is actually quite subtle, but I think that after all the dust clears this intuition is basically right.)
What’s really important here is that our system uses a prior which is competitive, as evaluated by our real, endorsed (inaccessible) prior. A neural net will never be using the “real” prior, since it’s built on a towering stack of imperfect approximations and is computationally bounded. But it still makes sense to ask for it to be “as good as possible” given the limitations of its learning process — we want to avoid the situation where the neural net is able to learn a new prior which predictably to outperforms the outer prior. In that situation we can’t just blame the neural net, since it’s demonstrated that it’s able to learn something better.
In general, I think that competitiveness is a desirable way to achieve stability — using a suboptimal system is inherently unstable, since it’s easy to slip off of the desired equilibrium to a more efficient alternative. Using the wrong prior is just one example of that. You can try to avoid slipping off to a worse equilibrium, but you’ll always be fighting an uphill struggle.
Given that I think that finding the right universal prior should be “plan A.” The real question is whether that’s tractable. My current view is that it looks plausible enough (see Learning the prior for my current best guess about how to approach it) that it’s reasonable to focus on for now.
Better priors as a safety problem was originally published in AI Alignment on Medium, where people are continuing the conversation by highlighting and responding to this story.
To the extent that we instincitively believe or disbelieve this, it’s not for the right reasons—natural selection didn’t have any evidence to go on. At most, that instinct is a useful workaround for the existential dread glitch.
Assume that there is a real prior (I like to call this programming language Celestial), and that it can be found from first principles and having an example universe to work with. Then I wouldn’t be surprised if we receive more weight indirectly than directly. After all:
Our laws of physics may be simple, but us seeing a night sky devoid of aliens suggests that it takes quite a few bits to locate us in time and space and improbability.
An anthropic bias would circumvent this, and agents living in the multiverse would be incentivized to implement it: The universes thereby promoted are particularly likely to themselves simulate the multiverse and act on what they see, and those are the only universes vulnerable to the agent’s attack.
Our universe may be particularly suited to simulate the multiverse in vulnerable ways, because of our quantum computers. All it takes is that we run a superposition of all programs, rely on a mathematical heuristic that tells us that almost all of the amplitudes cancel out, and get tricked by the agent employing the sort of paradox of self-reference that mathematical heuristics tend to be wrong on.
If the quirks of chaos theory don’t force the agent to simulate all of our universe to simulate any of it, then at least the only ones of us that have to worry about being simulated in detail in preparation of an attack are AI/AI safety researchers :P.
To the extent that we believe this correctly, it’s for the same reasons that we are able to do math and philosophy correctly (or at least more correctly than chance :) despite natural selection not caring about it much. It’s the same reason that you can correctly make arguments like the one in your comment.
Summary for the Alignment Newsletter (also includes a summary for Learning the prior):
Planned opinion:
This will probably go out in the newsletter 9 days from now instead of the next one, partially because I have two things to highlight and I’d rather send them out separately, and partially because I’m not confident my summary / opinion are correct and I want to have more time for people to point out flaws.
I didn’t quite follow this bit. In particular, I’m not sure which of “real world” and “right prior” refers to an actual physical world, and which refers to a simulation or dream (or if that’s even the right way to distinguish between the two).
I think this is saying something about having a prior over base-level universes or over simulated (or imagined) universes. And I think maybe it (and the surrounding context) is saying that it’s more useful to have a prior that you’re in a “real” universe (because otherwise you maybe don’t care what happens). But I’m not confident of that interpretation.
Is that on the right track?
I too was confused by that bit. I think the reason why the hypothesis that the world is a dream seems absurd has very little to do with likelihood ratios and everything to do with heuristics like “don’t trust things that sound like what a crazy person, drug-addled person, or mystic would say.” I get the sense that Paul thinks the “right” prior assigns low credence to being in a simulation, but that seems false to me. Paul if you read this I’d love to hear your thoughts on the simulation argument.
I think that under the counting measure, the vast majority of people like us are in simulations (ignoring subtleties with infinities that make that statement meaningless).
I think that under a more realistic measure, it’s unclear whether or not most people like us are in simulations.
Those statements are unrelated to what I was getting at in the post though, which is more like: the simulation argument rests on us being the kind of people who are likely to be simulated, we don’t think that everyone should believe they are in a simulation because the simulators are more likely to simulate realistic-looking worlds than reality is to produce realistic-looking worlds, that seems absurd.
The whole thing is kind of a complicated mess and I wanted to skip it by brushing aside the simulation argument. Maybe should have just not mentioned it at all given that the simulation argument makes such a mess of it. I don’t expect to be able to get clarity in this thread either :)
It’s not the hypothesis that’s absurd, it’s this particular argument.
What sorts of measures do you have in mind, when you say ”...a more realistic measure?” A simplicity measure will still yield the result that most people like us are in simulations, I think.
I interpret you as saying that P(ourdata|simulated) < P(ourdata|not-simulated). This is plausible, but debatable—e.g. the joke that Elon Musk is probably in a simulation because he’s such a special person living such a crazy life. Also more seriously the arguments that we are at a special time in history, precisely the time that you would expect most simulations to be of. Also one might think that most non-simulated minds exist in some sort of post-singularity world, whereas plausibly most simulated minds exist in what appears to be a pre-singularity world...