Prediction can be Outer Aligned at Optimum
This post argues that many prediction tasks are outer aligned at optimum. In particular, I think that the malignity of the universal prior should be treated as an inner alignment problem rather than an outer alignment problem. The main argument is entirely in the first section; treat the rest as appendices.
Short argument
In Evan Hubinger’s Outer Alignment and Imitative Amplification, outer alignment at optimum is defined as follows:
a loss function is outer aligned at optimum if all the possible models that perform optimally according to that loss function are aligned with our goals
In the (highly recommended) Overview of 11 proposals for building safe advanced AI, this is used to argue that:
Imitative amplification is probably outer aligned (because it emulates HCH, as explained in the first linked post, and in section 2 of the overview)
Microscope AI (section 5) and STEM AI (section 6) are probably outer misaligned, because they rely on prediction, and optimal prediction is characterised by Bayesian inference on the universal prior. This is a problem, since the universal prior is probably malign (see Mark Xu’s explanation and Paul Christiano’s original arguments).
I disagree with this, because I think that both imitative amplification and STEM AI would be outer aligned at optimum, and that some implementations of microscope AI would be outer aligned at optimum (see the next section for cases where it might not be). This is because optimal prediction isn’t necessarily outer misaligned.
The quickest way to see this is to note that everything is prediction. Imitative amplification relies on imitation learning, and imitating a human via imitation learning is equivalent to predicting what they’ll do (and then doing it). Thus, if microscope AI or STEM AI is outer misaligned due to relying on prediction, imitation learning is just as misaligned.
However, I don’t think that relying on prediction makes a method outer misaligned at optimum. So what’s wrong with the argument about the universal prior? Well, note that you don’t get perfect prediction on a task by doing ideal Bayesian inference on any universal prior. Different universal priors yield different predictions, so some of them are going to be better than others. Now, define optimal performance as requiring “the model to always have optimal loss on all data points that it ever encounters”, as Evan does in this footnote. Even if almost all universal priors are malign, I claim that the prior that actually does best on all data it encounters is going to be aligned. Optimising for power requires sacrificing prediction performance, so the misaligned priors are going to be objectively worse. For example, the model that answers every STEM question correctly doesn’t have any wiggle room to optimise against us, because any variation in its answers would make one of them incorrect.
The problem we face is that it’s really hard to tell whether our model actually is optimal. This is because consequentialists in a malign prior can intentionally do well on the training and validation data, and only sacrifice prediction performance during deployment. Importantly, this problem applies equally strongly for imitation learning as for other types of prediction.
The most important consequence of all this is that the universal prior’s malignity should be classified as a problem with inner alignment, instead of outer alignment. I think this is good and appropriate, because concerns about consequentialists in the universal prior are very similar to concerns about mesa optimisers being found by gradient descent. In both cases, we’re worried that:
the inductive biases of our algorithm…
will favor power-seeking consequentialists…
who correctly predict the training data in order to stick around…
but eventually perform a treacherous turn after the distributional shift caused by deployment.
Online learning may be misaligned
However, one particular case where prediction may be outer misaligned is when models are asked to predict the future, and the prediction can affect what the true answer is.
For example, the stereotypical case of solomonoff induction is to place a camera somewhere on Earth, and let the inductor predict all future bits that the camera will observe. If we implemented this with a neural network and changed our actions based on what the model predicted, I’m not confident that the result would be outer aligned at optimum. However, this is unrelated to the malignity of the universal prior – instead, it’s because we might introduce strange incentives by changing our actions based on the AI’s predictions. For example, the AI might get the lowest loss if it systematically reports predictions that causes us to make the world as predictable as possible (and if we’re all dead, the world is pretty predictable…). For more on this, see Abram’s parable (excluding section 9, which is about inner alignment) and associated discussion.
To see why this concern is unrelated to the malignity of the universal prior, note that there are two cases (that mostly depend on our definition of “at optimum”):
For each question, there is only a single prediction that leads to the lowest loss.
For example, if we assume that the model has some degree of uncertainty about the future, it is likely that some prediction will affect the world in a way such that it’s very likely to come true, whereas other self-fulfilling prophecies may only be somewhat likely to come true.
In this case, my argument from above applies: the model’s behavior is fully specified by what the right answer is; the best prior for the situation will be selected; and there’s no room for (additional) malignity without making the prediction worse.
For some questions, there are multiple predictions that lead to optimal performance.
In this case, the model’s behavior isn’t fully specified by saying that it performs optimally.
However, Evan’s definition says that “a loss function is outer aligned at optimum if all the possible models that perform optimally according to that loss function are aligned with our goals”. Thus, we can tell whether the model is outer aligned at optimum by checking all combinations of optimal predictions, without having to consider the inductive biases of the model.
In practice, I don’t expect this type of concern to matter for STEM AI, but it may be a concern with some implementations of microscope AI.
As an aside, note that the definition of outer alignment at optimum only specifies the model’s input-output behavior, whereas for microscope AI, we care about the model’s internal structure. Thus, it’s unclear how to apply the definition to microscope AI. One option is to apply it to the system as a whole – including the humans trying to make the predictions. Alternatively, microscope AI may be such a unique proposal that the standard definition just isn’t useful.
How to define optimal performance?
While I think that my short argument is broadly correct, it does sweep a lot of ambiguities under the rug, because it’s surprisingly annoying to construct a rigorous definition of outer alignment at optimum. The problem is that – in order to define optimal performance – we need both a loss function and a distribution of data to evaluate the loss function on. In many cases, it’s quite difficult to pinpoint what the distribution of data is, and exactly how to apply the loss function to it. Specifically, it depends on how the correct answer is labelled during training and deployment.
How is the correct answer labelled?
As far as I can tell, there are at least 4 cases, here:
Mechanistic labelling: In this setting, the model’s success can be verified in code. For example, when a model learns to play chess, it’s easy to check whether it won or not. Some versions of STEM AI could be in this category.
Real-world labelling: In this setting, the model is interacting with or trying to predict the real world, such that success or failure is almost mechanically verifiable once we see how the world reacts. An example of this is an AI tasked to predict all bits observed by a single camera. Some versions of STEM AI or microscope AI could be in this category.
Human labelling: These are cases where data points are labelled by humans in a controlled setting. Often there are some human(s), e.g. an expert labeller or a group of mturkers, who look at input and are tasked with returning some output. This includes e.g. training on ImageNet, and the kind of imitation learning that’s necessary for imitative amplification.
Unlabelled: These are cases where the AI learns the distribution of a big pile of pre-existing data in an unsupervised manner. The distribution is then often used for purposes quite different from the training setting. GPT-3 and DALL-E are both examples of this type of learning. Some types of STEM AI or microscope AI could be in this category.
Each of these offer different options for defining optimal performance.
With mechanistic labelling, optimal performance is unambiguous.
With real-world labelling, you can define optimal performance as doing whatever gives the network optimal reward; since each of the model’s actions eventually does lead to some reward.
There’s some question about whether you should define optimal performance as always predicting the thing that actually happens, or whether you should assume that the model has some particular uncertainty about the world, and define optimal performance as returning the best probability distribution, given its uncertainty. I don’t think this matters very much.
If implemented in the wrong way, these models are vulnerable to self-fulfilling prophecies. Thus, they may be outer misaligned at optimum, as mentioned in the section on online learning above.
With human labelling, we rarely get an objective answer during deployment, since we don’t ask humans to label data once training is over. However, we can define optimal performance according to what the labeller would have done, if they had gotten some particular input.
In this case, the model must return some probability distribution over labels, since there’s presumably some randomness in how the labeller acts.
However, if we want to, we can still assume that the AI’s knowledge of the outside world is completely fixed, and ask about the (impossible) counterfactual of what the human would have answered if they’d gotten a different input than they did.
For unsupervised learning, this option isn’t available, because there’s no isolated human who we can separate from the rest of the world. If GPT-3 is presented with a prompt “Hi I’m Lukas and…”, it cannot treat it’s input as some fixed human(s) H that reacts to their input as H(“Hi I’m Lukas and...”). Instead, the majority of GPT-3’s job goes towards updating on the fact that the current source is apparently called Lukas, is trying to introduce themselves, and whatever that implies about the current source, and about the world at large[1]. This means that for human-labelled data, we can assume that the world is fixed (or that the model’s uncertainty about the world is fixed), and only think about varying the input to the labeller. However, for unsupervised learning, we can’t hold the world fixed, because we need to find the most probable world where someone produced that particular input.
Extending the training distribution
As a consequence, when defining optimal performance for unsupervised learning, we need to define a full distribution of all possible inputs and outputs. GPT-3 was trained on internet text, but the internet text we have is very small compared to all prompts you could present to GPT-3. To define optimal performance, we therefore need to define a hypothetical process that represents much more internet text. Here are a few options for doing that:
-
Choose some universal prior as a measure over distributions; condition it on our finite training data; and use the resulting distribution as our ground truth distribution. As our universal prior, we could use the prior defined by some simple programming language (e.g. python) or the “true” prior that our universe is sampled from (if that’s coherent).
Due to the ordinary arguments about the universal prior being malign, this wouldn’t be outer aligned at optimum. Since this definition would mean that almost nothing is outer aligned, it seems like a bad definition.
-
Defining the ground-truth of correct generalisation as the way that humans would generalise, if they became really good at predicting the training text.
The problem with this definition is that we want to track the alignment properties of algorithms even as they reach far beyond human performance
One option for accessing superhuman performance with human-like generalisation intuitions is to do something like Paul Christiano’s Learning the prior.
While some variant of this could be good, becoming confident that it’s good would involve solving a host of AI alignment problems along the way, which I unfortunately won’t do in this post.
-
Use quantum randomness as our measure over distributions. More specifically, choose some point in the past (e.g. when Earth was created 4 billion years ago, or the internet was created 40 years ago), and then consider all possible futures from that moment, using quantum fluctuations as the only source of “randomness”. Use the Born rule to construct a measure over these worlds. (If you prefer copenhagen-like theories, this will be a probability measure. If you prefer multiverse theories, this will be a measure over close-by Everett branches.)
Then, exclude all worlds that in the year 2020 don’t contain a model with GPT-3’s architecture that was trained on GPT-3’s training data. Most of the remaining worlds will have some unobserved validation set that the researchers didn’t use during training. We can then define optimal performance as the distribution over all these validation sets, weighted by our quantum measure over the worlds they show up in.
As far as I can tell, this is mostly well defined, and seems to yield sensible results. Since GPT-3’s training data contains so many details of our world; every world that contains a similar dataset will be very similar to our world. Lots of minor details will presumably vary, though, which means that the unobserved data should contain a wide and fair distribution.
There are some subtleties about how we treat worlds where GPT-3 was trained multiple times on the same data, or how we treat different sizes of validation sets, etc; but I don’t think it matters much.
I’m a bit wary of how contrived this definition is, though. We would presumably have wanted some way of defining counterfactuals even if quantum mechanics hadn’t offered this convenient splitting mechanic, so there ought to be some less hacky way of doing it[2].
If we wanted to, I think we could use a similar definition also in situations with real-world labelling or human labelling. Ie., we could require even an optimal model to be uncertain about everything that wasn’t universal across all Everett branches containing its training data. The main concern about this is that some questions may be deeply unlikely to appear in training data in the year 2020 (e.g. a question with the correct factorisation of RSA-2048) in which case being posed that question may move the most-likely-environment to some very strange subset of worlds. I’m unsure whether this would be a problem or not.
A note about simulations
Finally, since these definitions refer to what the AI would actually encounter, in our world, I want to briefly mention an issue with simulations. We don’t only need to worry about the possibility that a solomonoff inductor thinks its input is being simulated – we should also consider the possibility that we are in a simulation.
Most simulations are probably short-lived, since simulating low-tech planets is so much cheaper than simulating the colonisation of galaxies. Thus, if we’re total consequentialist, the long-term impact we can have in a simulation is negligible compared to the impact we can have in the real world (unless the sheer number of simulations outweighs the long-term impact we can have, see How the Simulation Argument Dampens Future Fanaticism).
As a consequence of this, the only real alignment problem introduced by simulations is if an AI assigns substantial probability to being in a simulation despite actually being in the real world. This would be bad because – if the AI predicts things as if it was in a simulation – the consequentialists that the AI believes controls the simulation will have power over what predictions it makes, which they can use to gain power in the real world. This is just a variation of the universal prior being malign, where the consequentialists are hypothesized to simulate all of Earth instead of just the data that the AI is trying to predict.
As far as practical consequences go, I think this should be treated the same as the more general problem of the universal prior being malign. Thus, I’d like to categorise it as a problem with inner alignment; and I’d like to assume that an AI that’s outer aligned at optimum would act like it’s not in a simulation, if it is in fact not in a simulation.
This happens by default if our chosen definition of optimal performance treats being-in-a-simulation as a fixed fact about its environment – that the AI is expected to know – and not as a source of uncertainty. I think my preferred solutions above capture this by default[3]. For any solution based on how humans generalise, though, it would be important that the humans condition on not being in a simulation.
Thanks to Hjalmar Wijk and Evan Hubinger for helpful comments on earlier versions.
Notes
- ↩︎
One way to frame this is with Pearl’s do-calculus. Say that the input is a random variable X and the output is a random variable Y. By analogy with Pearl’s do-calculus, we could then define optimal human-labelled performance as learning the distribution p(y=Y | do(x=X)), whereas unsupervised learning is trying to learn the entire distribution p(X,Y) in order to answer p(y=Y | x=X). For GPT-3, learning p(y=Y | do(x=X)) would correspond to guessing what a random human would say if they learned that they’d just typed “Hi I’m Lukas and…”; which would be very strange.
- ↩︎
One option is to partition the world into “macrostates” (e.g. a specification of where all humans are, what they’re saying, what the weather is, etc) and “microstates” (a complete specification of the location and state of all elementary particles), where each macrostate is consistent with lots of microstates. Then, we can specify a year; and assume that we know the macrostate of the world at the beginning of the year, but are uncertain about the microstate. If we then wait long enough, the uncertainty in microstate would eventually induce variation in macrostates; which we could use to define a distribution over data. I think this would probably yield the same results as the quantum definition; but the distinction between macrostates and microstates is a lot more vague than our understanding of quantum mechanics.
- ↩︎
This is the reason I wrote that we should exclude each world that “in the year 2020 don’t contain a model with GPT-3’s architecture that was trained on GPT-3’s training data”. Without the caveat about 2020, we would accidentally include worlds where humanity’s descendants decide to simulate their ancestors.
This is unsatisfying to me. First you say that we can’t define optimum in the obvious way because then very few things would be outer aligned, then you say we should define optimum in such a way that the only way to be outer aligned is to assume you aren’t in a simulation. (How else would we get an AI that act’s like it’s not in a simulation, if it is in fact not in a simulation? You can’t tell whether you are in a simulation or not, by definition, so the only way for such an AI to exist is for it to always act like it’s not in a simulation, i.e. to assume.) An AI that assumes it isn’t in a simulation seems like a defective AI to me, so it’s weird to build that in to the definition of outer alignment.
It’s possible I’m misunderstanding you though!
Things I believe about what sort of AI we want to build:
It would be kind of convenient if we had an AI that could help us do acausal trade. If assuming that it’s not in a simulation would preclude an AI from doing acausal trade, that’s a bit inconvenient. However, I don’t think this matters for the discussion at hand, for reasons I describe in the final array of bullet points below.
Even if it did matter, I don’t think that the ability to do acausal trade is a deal-breaker. If we had a corrigible, aligned, superintelligent AI that couldn’t do acausal trade, we could ask it to scan our brains, then compete through any competitive period on Earth / in space, and eventually recreate us and give us enough time to figure out this acausal trade thing ourselves. Thus, for practical purposes, an AI that assumes it isn’t in a simulation doesn’t seem defective to me, even if that means it can’t do acausal trade.
Things I believe about how to choose definitions:
When choosing how to define our terms, we should choose based on what abstractions are most useful for the task at hand. For the outer-alignment-at-optimum vs inner alignment distinction, we’re trying to choose a definition of “optimal performance” such that we can separately:
Design an intent-aligned AI out of idealised training procedures that always yield “optimal performance” on some metric. If we successfully do this, we’ve solved outer alignment.
Figure out a training procedure that produces an AI that actually does very well on the chosen metric (sufficiently well to be aligned, even if it doesn’t achieve absolute optimal performance). If we do this, we’ve solved inner alignment.
Things I believe about what these candidate definitions would imply:
For every AI-specification built with the abstraction “Given some finite training data D, the AI predicts the next data point X according to how common it is that X follows D across the multiverse”, I think that AI is going to be misaligned (unless it’s trained with data that we can’t get our hands on, e.g. infinite in-distribution data), because of the standard universal-prior-is-misaligned-reasons. I think this holds true even if we’re trying to predict humans like in IDA. Thus, this definition of “optimal performance” doesn’t seem useful at all.
For AI-specification built with the abstraction “Given some finite training data D, the AI predicts the next data point X according to how common it is that X follows D on Earth if we aren’t in a simulation”, I think it probably is possible to build aligned AIs. Since it also doesn’t seem impossible to train AIs to do something like this (ie we haven’t just moved the impossibility to the inner alignment part of the problem), it seems like a pretty good definition of “optimal performance”.
Surprisingly, I think it’s even possible to build AIs that do assign some probability to being in a simulation out of this. E.g. we could train the AI via imitation learning to imitate me (Lukas). I assign a decent probability to being in a simulation, so a perfect Lukas-imitator would also assign a decent probability to being in a simulation. This is true even if the Lukas-imitator is just trying to imitate the real-world Lukas as opposed to the simulated Lukas, because real-world Lukas assigns some probability to being simulated, in his ignorance.
I’m also open to other definitions of “optimal performance”. I just don’t know any useful ones other than the ones I mention in the post.
Thanks, this is helpful.
--You might be right that an AI which assumes it isn’t in a simulation is OK—but I think it’s too early to conclude that yet. We should think more about acausal trade before concluding it’s something we can safely ignore, even temporarily. There’s a good general heuristic of “Don’t make your AI assume things which you think might not be true” and I don’t think we have enough reason to violate it yet.
--You say
Isn’t that exactly the point of the universal prior is misaligned argument? The whole point of the argument is that this abstraction/specification (and related ones) is dangerous. So… I guess your title made it sound like you were teaching us something new about prediction (as in, prediction can be outer aligned at optimum) when really you are just arguing that we should change the definition of outer-aligned-at-optimum, and your argument is that the current definition makes outer alignment too hard to achieve? If this is a fair summary of what you are doing, then I retract my objections I guess, and reflect more.
Yup.
I mean, it’s true that I’m mostly just trying to clarify terminology. But I’m not necessarily trying to propose a new definition – I’m saying that the existing definition already implies that malign priors are an inner alignment problem, rather than than an issue with outer alignment. Evan’s footnote requires the model to perform optimally on everything it actually encounters in the real world (rather than asking it to do as well as it can across the multiverse, given its training data); so that definition doesn’t have a problem with malign priors. And as Richard notes here, common usage of “inner alignment” refers to any case where the model performs well on the training data but is misaligned during deployment, which definitely includes problems with malign priors. And per Rohin’s comment on this post, apparently he already agrees that malign priors are an inner alignment problem.
Basically, the main point of the post is just that the 11 proposals post is wrong about mentioning malign priors as a problem with outer alignment. And then I attached 3 sections of musings that came up when trying to write that :)
Well, at this point I feel foolish for arguing about semantics. I appreciate your post, and don’t have a problem with saying that the malignity problem is an inner alignment problem. (That is zero evidence that it isn’t also an outer alignment problem though!)
Evan’s footnote-definition doesn’t rule out malign priors unless we assume that the real world isn’t a simulation. We may have good pragmatic reasons to act as if it isn’t, but I still think you are changing the definition of outer alignment if you think it assumes we aren’t in a simulation. But *shrug* if that’s what people want to do, then that’s fine I guess, and I’ll change my usage to conform with the majority.
Cool, seems reasonable. Here are some minor responses: (perhaps unwisely, given that we’re in a semantics labyrinth)
Idk, if the real world is a simulation made by malign simulators, I wouldn’t say that an AI accurately predicting the world is falling prey to malign priors. I would probably want my AI to accurately predict the world I’m in even if it’s simulated. The simulators control everything that happens anyway, so if they want our AIs to behave in some particular way, they can always just make them do that no matter what we do.
Fwiw, I think this is true for a definition that always assumes that we’re outside a simulation, but I think it’s in line with previous definitions to say that the AI should think we’re not in a simulation iff we’re not in a simulation. That’s just stipulating unrealistically competetent prediction. Another way to look at it is that in the limit of infinite in-distribution data, an AI may well never be able to tell whether we’re in the real world or in a simulation that’s identical to the real world; but they would be able to tell whether we’re in a simulation with simulators who actually intervene, because it would see them intervening somewhere in its infinite dataset. And that’s the type of simulators that we care about. So definitions of outer alignment that appeal to infinite data automatically assumes that AIs would be able to tell the difference between worlds that are functionally like the real world, and worlds with intervening simulators.
And then, yeah, in practice I agree we won’t be able to learn whether we’re in a simulation or not, because we can’t guarantee in-distribution data. So this is largely semantics. But I do think definitions like this end up being practically useful, because convincing the agent that it’s not individually being simulated is already an inner alignment issue, for malign-prior-reasons, and this is very similar.
Strongly agree that the universal prior being malign is an inner alignment concern. I think there’s actually a simpler argument: Solomonoff induction is a learning process, whereas the definition is about the loss at optimum. We could operationalize that as the limit of infinite data in this case. In the limit of infinite data, Solomonoff induction is not malign (in the way that is usually meant), independent of the base universal Turing machine that you are using. The reason we’re worried is that after seeing some finite amount of data, malign consequentialists might “take over” in a way that makes the infinite limit irrelevant.
That said, I think if you use the “outer aligned at optimum” you still expect that STEM AI and Microscope AI are misaligned, for a different reason Evan mentioned:
That is, if you write down a loss function like “do the best possible science”, then the literal optimal AI would take over the world and get a lot of compute and robots and experimental labs to do the best science it can do.
Note that in general I don’t really like the “outer aligned at optimum” definition, because as you note its not clear how you define optimal performance (I especially think the problem of “what distribution are we considering” is a hard one), though I don’t think that matters very much for the points discussed here.
I think this would be true for some way to train a STEM AI with some loss functions (especially if it’s RL-like, can interact with the real world, etc) but I think that there are some setups where this isn’t the case (e.g. things that look more like alphafold). Specifically, I think there exists some setups and some parsimonious definition of “optimal performance” such that optimal performance is aligned: and I claim that’s the more useful definition.
To be more concrete, do you think that an image classifier (trained with supervised learning) would have convergent instrumental goals that goes against human interests? For image classifiers, I think there’s a natural definition of “optimal performance” that corresponds to always predicting the true label via the normal output channel; and absent inner alignment concerns, I don’t think a neural network trained on infinite data with SGD would ever learn anything less aligned than that. If so, it seems like best definition of “at optimum” is the definition that says that the classifier is outer aligned at optimum.
Roughly speaking, you can imagine two ways to get safety:
Design the output channels so that unsafe actions / plans do not exist
Design the AI system so that even though unsafe actions / plans do exist, the AI system doesn’t take them.
I would rephrase your argument as “there are some types of STEM AI that are safe because of 1, it seems that given some reasonable loss function those AI systems should be said to be outer aligned at optimum”. This is also the argument that applies to image classifiers.
----
In the case where point 1 is literally true, I just wouldn’t even talk about whether the system is “aligned”; if it doesn’t have the possibility of an unsafe action, then whether it is “aligned” feels meaningless to me. (You can of course still say that it is “safe”.)
Note that in any such situation, there is no inner alignment worry. Even if the model is completely deceptive and wants to kill as many people as possible, by hypothesis we said that unsafe actions / plans do not exist, and the model can’t ever succeed at killing people.
----
A counterargument could be “okay, sure, some unsafe action / plan exists by which the AI takes over the world, but that happens only via side channels, not via the expected output channel”.
I note that in this case, if you include all the channels available to the AI system, then the system is not outer aligned at optimum, because the optimal thing to do is to take over the world and then always feed in inputs to which the outputs are perfectly known leading to zero loss.
Presumably what you’d want instead is to say something like “given a model in which the only output channel available to the AI system is ___, the optimal policy that only gets to act through that channel is aligned”. But this is basically saying that in the abstract model you’ve chosen, (1) applies; and again I feel like saying that this system is “aligned” is somehow missing the point of what “aligned” is supposed to mean.
As a concrete example, let’s take your image classifier example. 1. If we change the loss function so that dogs are labeled as cats and vice versa, is it still outer aligned at optimum (assuming the original was)? 2. What if it labeled humans as gorillas?
If you said yes to both, it’s still outer aligned at optimum, then hopefully you can see why the concept feels meaningless to me in this situation.
If you said no to both, these examples are no longer outer aligned at optimum, then I claim that the original loss function is also not outer aligned at optimum, because we could improve the categories used in the loss function (and it seems you agree that if the categories are worse then it is not outer aligned at optimum).
If you said yes to the first and no to the second, or yes to the second and no to the first, I have no idea what you mean by “outer aligned at optimum”.
----
Separately, even when you limit to a specific action space like classifying images, I could imagine that a literally optimal policy would still be able to take over the world given that action space (think of a policy that can predict and use the butterfly effect of classifying images), so I still don’t feel like it’s outer aligned at optimum. (Although perhaps this still doesn’t perform as well as the policy that magically knows all the answers and so can perfectly predict (what we label as) the class of any image.)
But this is not my real objection; my real objection is what I discussed above (that the concept “alignment” should not be tracking whether there does or does not exist an unsafe action in the AI’s action space).
Oops, I actually wasn’t trying to discuss whether the action-space was wide enough to take over the world. Turns out concrete examples can be ambiguous too. I was trying to highlight whether the loss function and training method incentivised taking over the world or not.
Instead of an image-classifier, lets take GPT-3, which has a wide enough action-space to take over the world. Lets assume that:
1. GPT-3 is currently being tested on on a validation set which have some correct answers. (I’m fine with “optimal performance” either requiring that GPT-3 magically returns these correct answers; or requiring that it returns some distribution along the lines that I defined in my post.)
2. The researchers will read what GPT-3 outputs, and there exists some string that causes them to go mad and give control over the world to GPT-3.
In this case, if we define optimal performance as “correctly predicting as many words as possible” or “achieve minimum total loss over the entire history of the world”, I agree that optimal performance would plausibly involve taking over the world to feed itself the right questions. However, I think a better definition is “Correctly predicts every word it’s asked to predict”, because I think this tracks what (un)supervised learning actually selects for (absent inner alignment concerns).
To make that last point more clear; I’m claiming that in a supervised learning scenario where predicting label X always gives the neural network a small punishment, but then causes the training procedure to give the model 10 easy questions, the model will learn to not choose label X. (At least in the long run. In the short run, I could see the model being a little bit more likely to predict label X than expected, because easy questions could give the model a less steep gradient, so a neural network that predicts X would stick around a bit longer than it would have if X hadn’t yielded easy questions. However, every time X is chosen, the nn is directionally pushed away from choosing it again; so in the infinite data limit, I think it would learn to not do that.) Conversely, if predicting label Y is correct on some inputs, but predicting Y immeidately lead to 10 hard questions, I don’t think the model would learn to avoid label Y (though I’m unsure if the learning process would converge to choosing Y or just be unstable and never converge). I’m actually very curious if you agree with this; it seems like an important question.
(Conversely, I think that a robot RL-agent would be incentivised to control its future challenges, because of how RL loss functions are shaped. A definition of RL optimal behavior should naturally take this into account.)
Now, by defining optimal behavior as “Correctly predicts every word it’s asked to predict”, I’m saying that the model is outer aligned at optimum iff the model cannot simultaneously answer every question optimally and adversarially optimise against humans. And if we only consider the normal output channel, I’m saying it couldn’t, because it would have to make incorrect predictions in order to broadcast its message; in which case it wouldn’t be optimal.
If we also consider side-channels, this gets messier, because my chosen definition doesn’t imply any constraints on what the AI could do with side channels. Thus, if it could use side channels to take over the world in any way, my definition would say it wasn’t outer aligned. My preferred solution to this is just to intentionally ignore side channels when defining outer alignment at optimum (as you correctly inferred). Three reasons for this:
As mentioned above, taking side channels into account would mean that any model with powerful side channels is classified as outer misaligned, even if there’s no incentive to use these side channels in any particular way.
Separately, I suspect that supervised learning normally doesn’t incentivise neural networks to use side channels in any particular way (absend inner alignment concerns).
Finally, It just seems kind of useful to talk about the outer alignment properties of abstract agent-models, since not all abstract agent-models are outer aligned. Side-constraints can be handled separately.
(Btw I’d say the bad image classifier is aligned but less performance-competitive than a good image classifier, though I’m sympathetic to the view that it doesn’t make sense to talk about its alignment properties at all.)
Ah, in hindsight your comment makes more sense.
Argh, I don’t know, you’re positing a setup that breaks the standard ML assumptions and so things get weird. If you have vanilla SGD, I think I agree, but I wouldn’t be surprised if that’s totally wrong.
There are definitely setups where I don’t agree, e.g. if you have an outer hyperparameter tuning loop around the SGD, then I think you can get the opposite behavior than what you’re claiming (I think this paper shows this in more detail, though it’s been edited significantly since I read it). That would still depend on how often you do the hyperparameter tuning, what hyperparameters you’re allowed to tune, etc.
----
On the rest of the comment: I feel like the argument you’re making is “when the loss function is myopic, the optimal policy ignores long-term consequences and is therefore safe”. I do feel better about this calling this “aligned at optimum”, if the loss function also incentivizes the AI system to do that which we designed the AI system for. It still feels like the lack of convergent instrumental subgoals is “just because of” the myopia, and that this strategy won’t work more generally.
----
Returning to the original claim:
I do agree that these setups probably exist, perhaps using the myopia trick in conjunction with the simulated world trick. (I don’t think myopia by itself is enough; to have STEM AI enable a pivotal act you presumably need to give the AI system a non-trivial amount of “thinking time”.) I think you will still have a pretty rough time trying to define “optimal performance” in a way that doesn’t depend on a lot of details of the setup, but at least conceptually I see what you mean.
I’m not as convinced that these sorts of setups are really feasible—they seem to sacrifice a lot of benefits—but I’m pretty unconfident here.
I think this is pretty complicated, and stretches the meaning of several of the critical terms employed in important ways. I think what you said is reasonable given the limitations of the terminology, but ultimately, may be subtly misleading.
How I would currently put it (which I think strays further from the standard terminology than your analysis):
Take 1
Prediction is not a well-defined optimization problem.
Maximum-a-posteriori reasoning (with a given prior) is a well-defined optimization problem, and we can ask whether it’s outer-aligned. The answer may be “no, because the Solomonoff prior contains malign stuff”.
Variational bayes (with a given prior and variational loss) is similarly well-defined. We can similarly ask whether it’s outer-aligned.
Minimizing square loss with a regularizing penalty is well-defined. Etc. Etc. Etc.
But “prediction” is not a clearly specified optimization target. Even if you fix the predictive loss (square loss, Bayes loss, etc) you need to specify a prior in order to get a well-defined expectation to minimize.
So the really well-defined question is whether specific predictive optimization targets are outer-aligned at optimum. And this type of outer-alignment seems to require the target to discourage mesa-optimizers!
This is a problem for the existing terminology, since it means these objectives are not outer-aligned unless they are also inner-aligned.
Take 2
OK, but maybe you object. I’m assuming that “optimization” means “optimization of a well-defined function which we can completely evaluate”. But (you might say), we can also optimize under uncertainty. We do this all the time. In your post, you frame “optimal performance” in terms of loss+distribution. Machine learning treats the data as a sample from the true distribution, and uses this as a proxy, but adds regularizers precisely because it’s an imperfect proxy (but the regularizers are still just a proxy).
So, in this frame, we think of the true target function as the average loss on the true distribution (ie the distribution which will be encountered in the wild), and we think of gradient descent (and other optimization methods used inside modern ML) as optimizing a proxy (which is totally normal for optimization under uncertainty).
With this frame, I think the situation gets pretty complicated.
Take 2.1
Sure, ok, if it’s just actually predicting the actual stuff, this seems pretty outer-aligned. Pedantic note: the term “alignment” is weird here. It’s not “perfectly aligned” in the sense of perfectly forwarding human values. But it could be non-malign, which I think is what people mostly mean by “AI alignment” when they’re being careful about meaning.
Take 2.2
But this whole frame is saying that once we have outer alignment, the problem that’s left is the problem of correctly predicting the future. We have to optimize under uncertainty because we can’t predict the future. An outer-aligned loss function can nonetheless yield catastrophic results because of distributional shift. The Solomonoff prior is malign because it doesn’t represent the future with enough accuracy, instead containing some really weird stuff.
So, with this terminology, the inner alignment problem is the prediction problem. If we can predict well enough, then we can set up a proxy which gets us inner alignment (by heavily penalizing malign mesa-optimizers for their future treacherous turns). Otherwise, we’re stuck with the inner alignment problem.
So given this use of terminology, “prediction is outer-aligned” is a pretty weird statement. Technically true, but prediction is the whole inner alignment problem.
Take 2.3
But wait, let’s reconsider 2.1.
In this frame, “optimal performance” means optimal at deployment time. This means we get all the strange incentives that come from online learning. We aren’t actually doing online learning, but optimal performance would respond to those incentives anyway.
(You somewhat circumvent this in your “extending the training distribution” section when you suggest proxies such as the Solomonoff distribution rather than using the actual future to define optimality. But this can reintroduce the same problem and more besides. Option #1, Solomonoff, is probably accurate enough to re-introduce the problems with self-fulfilling prophecies, besides being malign in other ways. Option #3, using a physical quantum prior, requires a solution to quantum gravity, and also is probably accurate enough to re-introduce the same problems with self-fulfilling prophecies as well. The only option I consider feasible is #2, human priors. Because humans could notice this whole problem and refuse to be part of a weird loop of self-fulfilling prediction.)