I’ll make a case here that manipulation of imperfect internal search should be considered the inner alignment problem, and all the other things which look like inner alignment failures actually stem from outer alignment failure or non-mesa-optimizer-specific generalization failure.
Example: Dr Nefarious
Suppose Dr Nefarious is an agent in the environment who wants to acausally manipulate other agents’ models. We have a model which knows of Dr Nefarious’ existence, and we ask the model to predict what Dr Nefarious will do. At this point, we have already failed: either the model returns a correct answer, in which case Dr Nefarious has acausal control over the answer and can manipulate us through it, or it returns an incorrect answer, in which case the prediction is wrong. (More precisely, the distinction is between informative/independent answers, not correct/incorrect.) The only way to avoid this would be to not ask the question in the first place—but if we need to know what Dr Nefarious will do in order to make good decisions ourselves, then we need to run that query.
On the surface, this looks like an inner alignment failure: there’s a malign subagent in the model. But notice that it’s not even clear what we want in this situation—we don’t know how to write down a goal-specification which avoids the problem while also being useful. The question of “what do we even want to do in this sort of situation?” is unambiguously an outer alignment question. It’s not a situation where we know what we want but we’re not sure how to make a system actually do it; it’s a situation where it’s not even clear what we want.
Conversely, if we did have a good specification of what we want in this situation, then we could just specify that in the outer objective. Once that’s done, we would still potentially need to solve inner alignment problems in practice, but we’d know how to solve them in principle: do the thing which is globally optimal for our outer objective. The whole point of “having a good specification of what we want” is that the globally-optimal thing should be good.
Point of all this: this supposed “inner alignment failure” can be broken into two parts. One of those parts is a “what do we even want?” question, i.e. an outer alignment problem. The other part is a problem of actually achieving the optimal thing, which is where manipulation of imperfect internal search is relevant. If both of those parts are solved, then the system is aligned.
Generalizing The Example
Another example, this time with explicit acausal trade: our AI uses a Solomonoff-like world model, and a subagent in that model is trying to gain influence. Meanwhile, an (unrelated) nefarious agent in the environment wants to manipulate the AI. So, the subagent and the nefarious agent simulate each other and make an acausal deal: the nefarious agent produces a very specific string of bits in the real world, and the subagent gains weight by perfectly predicting that string. In exchange, the subagent manipulates the AI to help the nefarious agent in some way.
Self-fulfilling prophecies provide a similar, but simpler, class of examples.
In each of these, there is a malign inner agent, but that malign inner agent is only able to manipulate the AI successfully because of some structure in the environment. Or, another way to state it: the malign agent is successful only because the combination of (outer) prior + objective does not handle self-fulfilling prophecies or acausal trade the way we (humans) want them to. These are, in an important sense, outer alignment problems: we have not correctly specified what-we-want; even the global optimum of the outer process suffers from the problem.
Objective Is Only Defined With Prior + Data
One possible objection to this is that “outer alignment”—i.e. specifying what-humans-want—should be more narrowly interpreted. In particular, Evan has argued before that generalization errors resulting from e.g. distribution shift between training data and deployment environment should be considered a separate problem.
I disagree with this. I claim that an objective isn’t even well-defined without a distribution; that’s part of the type-signature of an objective.
This is easy to see in the case of an expected utility maximizer. When we write “max E[u(X)]”, X is a variable in the probabilistic model. It is a thing-in-the-model, not a thing-in-the-world; the world does not necessarily share our ontology.
We could say something similar for any setup which maximizes some expected value on an empirical distribution, i.e. an average over the training data. For instance, maybe we have some labeled images, and we’re training a classifier. We may have an objective for which the system does-what-we-want for the original labels, but does not do what we want if we change the objective function to permute the labels before calculating error (i.e. it switches “true” with “false”). Permuting the labels in the objective function is obviously an outer alignment problem—yet we can achieve exactly the same effect by permuting the labels in the dataset instead.
Another angle: plenty of ML work uses the exact same objective on different data sets, and obviously they do completely different things. There is no useful sense in which a training objective can be aligned or misaligned, separate from the context of data/prior.
My point is: there is no line between bad training data and bad objective. These problems only make sense when considered together. So, if “bad training objective” is an outer alignment problem, then we also need to consider “bad training data” to be an outer alignment problem in order for our factorization of the problem to work well. (For Bayesian agents, this also extends to the prior.)
The General Argument
Outer objective, training data, and prior (if any) all have to be treated as a unit: changes in one are equivalent to changes in another, and the objective isn’t even well-defined outside the ontology of the data/prior. The central outer alignment question of “what do we even want?” has to be answered with both an objective and data/prior, in order for the answer to be well-defined.
If we buy that, then outer alignment (i.e. fully answering the question “what do we want?”) implies that the true global optimum of an outer optimizer’s search is aligned. So, there’s only one type of inner alignment problem which would not be solved by solving outer alignment: manipulation of imperfect search. We can have a good objective+prior+data, but the search may still be imperfect, and malign subagents may arise which manipulate that search.
All that said… there’s still an interesting alignment problem which examples like Dr Nefarious or self-fulfilling prophecies or maligness of Solomonoff are pointing to. I claim that inner alignment is not the right way to think about these—it’s not the malign inner agents themselves which are the problem. They’re just an indicator that we have not correctly specified what-we-want.
This is a great comment. I will have to think more about your overall point, but aside from that, you’ve made some really useful distinctions. I’ve been wondering if inner alignment should be defined separately from mesa-optimizer problems, and this seems like more evidence in that direction (ie, the dr nefarious example is a mesa-optimization problem, but it’s about outer alignment). Or maybe inner alignment just shouldn’t be seen as the compliment of outer alignment! Objective quality vs search quality is a nice dividing line, but, doesn’t cluster together the problems people have been trying to cluster together.
Right, but John is disagreeing with Evan’s frame, and John’s argument that such-and-such problems aren’t inner alignment problems is that they are outer alignment problems.
So, I think I could write a much longer response to this (perhaps another post), but I’m more or less not persuaded that problems should be cut up the way you say.
As I mentioned in my other reply, your argument that Dr. Nefarious problems shouldn’t be classified as inner alignment is that they are apparently outer alignment. If inner alignment problems are roughly “the internal objective doesn’t match the external objective” and outer alignment problems are roughly “the outer objective doesn’t meet our needs/goals”, then there’s no reason why these have to be mutually exclusive categories.
In particular, Dr. Nefarious problems can be both.
But more importantly, I don’t entirely buy your notion of “optimization”. This is the part that would require a longer explanation to be a proper reply. But basically, I want to distinguish between “optimization” and “optimization under uncertainty”. Optimization under uncertainty is not optimization—that is, it is not optimization of the type you’re describing, where you have a well-defined objective which you’re simply feeding to a search. Given a prior, you can reduce optimization-under-uncertainty to plain optimization (if you can afford the probabilistic inference necessary to take the expectations, which often isn’t the case). But that doesn’t mean that you do, and anyway, I want to keep them as separate concepts even if one is often implemented by the other.
Your notion of the inner alignment problem applies only to optimization.
Evan’s notion of inner alignment applies (only!) to optimization under uncertainty.
I buy the “problems can be both” argument in principle. However, when a problem involves both, it seems like we have to solve the outer part of the problem (i.e. figure out what-we-even-want), and once that’s solved, all that’s left for inner alignment is imperfect-optimizer-exploitation. The reverse does not apply: we do not necessarily have to solve the inner alignment issue (other than the imperfect-optimizer-exploiting part) at all. I also think a version of this argument probably carries over even if we’re thinking about optimization-under-uncertainty, although I’m still not sure exactly what that would mean.
In other words: if a problem is both, then it is useful to think of it as an outer alignment problem (because that part has to be solved regardless), and not also inner alignment (because only a narrower version of that part necessarily has to be solved). In the Dr Nefarious example, the outer misalignment causes the inner misalignment in some important sense—correcting the outer problem fixes the inner problem , but patching the inner problem would leave an outer objective which still isn’t what we want.
I’d be interested in a more complete explanation of what optimization-under-uncertainty would mean, other than to take an expectation (or max/min, quantile, etc) to convert it into a deterministic optimization problem.
I’m not sure the optimization vs optimization-under-uncertainty distinction is actually all that central, though. Intuitively, the reason an objective isn’t well-defined without the data/prior is that the data/prior defines the ontology, or defines what the things-in-the-objective are pointing to (in the pointers-to-values sense) or something along those lines. If the objective function is f(X, Y), then the data/prior are what point “X” and “Y” at some things in the real world. That’s why the objective function cannot be meaningfully separated from the data/prior: “f(X, Y)” doesn’t mean anything, by itself.
But I could imagine the pointer-aspect of the data/prior could somehow be separated from the uncertainty-aspect. Obviously this would require a very different paradigm from either today’s ML or Bayesianism, but if those pieces could be separated, then I could imagine a notion of inner alignment (and possibly also something like robust generalization) which talks about both optimization and uncertainty, plus a notion of outer alignment which just talks about the objective and what it points to. In some ways, I actually like that formulation better, although I’m not clear on exactly what it would mean.
According to you, the inner alignment problem should apply to well-defined optimization problems, meaning optimization problems which have been given all the pieces needed to score domain items. Within this frame, the only reasonable definition is “inner” = issues of imperfect search, “outer” = issues of objective (which can include the prior, the utility function, etc).
According to me/Evan, the inner alignment problem should apply to optimization under uncertainty, which is a notion of optimization where you don’t have enough information to really score domain items. In this frame, it seems reasonable to point to the way the algorithm tries to fill in the missing information as the location of “inner optimizers”. This “way the algorithm tries to fill in missing info” has to include properties of the search, so we roll search+prior together into “inductive bias”.
I take your argument to have been:
The strength of well-defined optimization as a natural concept;
The weakness of any factorization which separates elements like prior, data, and loss function, because we really need to consider these together in order to see what task is being set for an ML system (Dr Nefarious demonstrates that the task “prediction” becomes the task “create a catastrophe” if prediction is pointed at the wrong data);
The idea that the my/Evan/Paul’s concern about priors will necessarily be addressed by outer alignment, so does not need to be solved separately.
Your crux is, can we factor ‘uncertainty’ from ‘value pointer’ such that the notion of ‘value pointer’ contains all (and only) the outer alignment issues? In that case, you could come around to optimization-under-uncertainty as a frame.
I take my argument to have been:
The strength of optimization-under-uncertainty as a natural concept (I argue it is more often applicable than well-defined optimization);
The naturalness of referring to problems involving inner optimizers under one umbrella “inner alignment problem”, whether or not Dr Nefarious is involved;
The idea that the malign-prior problem has to be solved in itself whether we group it as an “inner issue” or an “outer issue”;
For myself in particular, I’m ok with some issues-of-prior, such as Dr Nefarious, ending up as both inner alignment and outer alignment in a classification scheme (not overjoyed, but ok with it).
My crux would be, does a solution to outer alignment (in the intuitive sense) really imply a solution to exorcising mesa-optimizers from a prior (in the sense relevant to eliminating them from perfect search)?
It might also help if I point out that well-defined-optimization vs optimization-under-uncertainty is my current version of the selection/control distinction.
In any case, I’m pretty won over by the uncertainty/pointer distinction. I think it’s similar to the capabilities/payload distinction Jessica has mentioned. This combines search and uncertainty (and any other generically useful optimization strategies) into the capabilities.
But I would clarify that, wrt the ‘capabilities’ element, there seem to be mundane capabilities questions and then inner optimizer questions. IE, we might broadly define “inner alignment” to include all questions about how to point ‘capabilities’ at ‘payload’, but if so, I currently think there’s a special subset of ‘inner alignment’ which is about mesa-optimizers. (Evan uses the term ‘inner alignment’ for mesa-optimizer problems, and ‘objective-robustness’ for broader issues of reliably pursuing goals, but he also uses the term ‘capability robustness’, suggesting he’s not lumping all of the capabilities questions under ‘objective robustness’.)
I’m still some combination of confused and unconvinced about optimization-under-uncertainty. Some points:
It feels like “optimization under uncertainty” is not quite the right name for the thing you’re trying to point to with that phrase, and I think your explanations would make more sense if we had a better name for it.
The examples of optimization-under-uncertainty from your other comment do not really seem to be about uncertainty per se, at least not in the usual sense, whereas the Dr Nefarious example and maligness of the universal prior do.
Your examples in the other comment do feel closely related to your ideas on learning normativity, whereas inner agency problems do not feel particularly related to that (or at least not any more so than anything else is related to normativity).
It does seem like there’s in important sense in which inner agency problems are about uncertainty, in a way which could potentially be factored out, but that seems less true of the examples in your other comment. (Or to the extent that it is true of those examples, it seems true in a different way than the inner agency examples.)
The pointers problem feels more tightly entangled with your optimization-under-uncertainty examples than with inner agency examples.
… so I guess my main gut-feel at this point is that it does seem very plausible that uncertainty-handling (and inner agency with it) could be factored out of goal-specification (including pointers), but this particular idea of optimization-under-uncertainty seems like it’s capturing something different. (Though that’s based on just a handful of examples, so the idea in your head is probably quite different from what I’ve interpolated from those examples.)
On a side note, it feels weird to be the one saying “we can’t separate uncertainty-handling from goals” and you saying “ok but it seems like goals and uncertainty could somehow be factored”. Usually I expect you to be the one saying uncertainty can’t be separated from goals, and me to say the opposite.
Your examples in the other comment do feel closely related to your ideas on learning normativity, whereas inner agency problems do not feel particularly related to that (or at least not any more so than anything else is related to normativity).
Could you elaborate on that? I do think that learning-normativity is more about outer alignment. However, some ideas might cross-apply.
It feels like “optimization under uncertainty” is not quite the right name for the thing you’re trying to point to with that phrase, and I think your explanations would make more sense if we had a better name for it.
Well, it still seems like a good name to me, so I’m curious what you are thinking here. What name would communicate better?
It does seem like there’s in important sense in which inner agency problems are about uncertainty, in a way which could potentially be factored out, but that seems less true of the examples in your other comment. (Or to the extent that it is true of those examples, it seems true in a different way than the inner agency examples.)
Again, I need more unpacking to be able to say much (or update much).
The pointers problem feels more tightly entangled with your optimization-under-uncertainty examples than with inner agency examples.
Well, the optimization-under-uncertainty is an attempt to make a frame which can contain both, so this isn’t necessarily a problem… but I am curious what feels non-tight about inner agency.
… so I guess my main gut-feel at this point is that it does seem very plausible that uncertainty-handling (and inner agency with it) could be factored out of goal-specification (including pointers), but this particular idea of optimization-under-uncertainty seems like it’s capturing something different. (Though that’s based on just a handful of examples, so the idea in your head is probably quite different from what I’ve interpolated from those examples.)
On a side note, it feels weird to be the one saying “we can’t separate uncertainty-handling from goals” and you saying “ok but it seems like goals and uncertainty could somehow be factored”. Usually I expect you to be the one saying uncertainty can’t be separated from goals, and me to say the opposite.
I still agree with the hypothetical me making the opposite point ;p The problem is that certain things are being conflated, so both “uncertainty can’t be separated from goals” and “uncertainty can be separated from goals” have true interpretations. (I have those interpretations clear in my head, but communication is hard.)
OK, so.
My sense of our remaining disagreement…
We agree that the pointers/uncertainty could be factored (at least informally—currently waiting on any formalism).
You think “optimization under uncertainty” is doing something different, and I think it’s doing something close.
Specifically, I think “optimization under uncertainty” importantly is not necessarilybest understood as the standard Bayesian thing where we (1) start with a utility function, (2) provide a prior, so that we can evaluate expected value (and 2.5, update on any evidence), (3) provide a search method, so that we solve the whole thing by searching for the highest-expectation element. Many examples of optimization-under-uncertainty strain this model. Probably the pointer/uncertainty model would do a better job in these cases. But, the Bayesian model is kind of the only one we have, so we can use it provisionally. And when we do so, the approximation of pointer-vs-uncertainty that comes out is:
Pointer: The utility function.
Uncertainty: The search plus the prior, which in practice can blend together into “inductive bias”.
This isn’t perfect, by any means, but, I’m like, “this isn’t so bad, right?”
I mean, I think this approximation is very not-good for talking about the pointers problem. But I think it’s not so bad for talking about inner alignment.
I almost want to suggest that we hold off on trying to resolve this, and first, I write a whole post about “optimization under uncertainty” which clarifies the whole idea and argues for its centrality. However, I kind of don’t have time for that atm.
However, when a problem involves both, it seems like we have to solve the outer part of the problem (i.e. figure out what-we-even-want), and once that’s solved, all that’s left for inner alignment is imperfect-optimizer-exploitation. The reverse does not apply: we do not necessarily have to solve the inner alignment issue (other than the imperfect-optimizer-exploiting part) at all.
The way I’m currently thinking of things, I would say the reverse also applies in this case.
We can turn optimization-under-uncertainty into well-defined optimization by assuming a prior. The outer alignment problem (in your sense) involves getting the prior right. Getting the prior right is part of “figuring out what we want”. But this is precisely the source of the inner alignment problems in the paul/evan sense: Paul was pointing out a previously neglected issue about the Solomonoff prior, and Evan is talking about inductive biases of machine learning algorithms (which is sort of like the combination of a prior and imperfect search).
So both you and Evan and Paul are agreeing that there’s this problem with the prior (/ inductive biases). It is distinct from other outer alignment problems (because we can, to a large extent, factor the problem of specifying an expected value calculation into the problem of specifying probabilities and the problem of specifying a value function / utility function / etc). Everyone would seem to agree that this part of the problem needs to be solved. The disagreement is just about whether to classify this part as “inner” and/or “outer”.
What is this problem like? Well, it’s broadly a quality-of-prior problem, but it has a different character from other quality-of-prior problems. For the most part, the quality of priors can be understood by thinking about average error being low, or mistakes becoming infrequent, etc. However, here, this kind of thinking isn’t sufficient: we are concerned with rare but catastrophic errors. Thinking about these things, we find ourselves thinking in terms of “agents inside the prior” (or agents being favored by the inductive biases).
To what extent “agents in the prior” should be lumped together with “agents in imperfect search”, I am not sure. But the term “inner optimizer” seems relevant.
I’d be interested in a more complete explanation of what optimization-under-uncertainty would mean, other than to take an expectation (or max/min, quantile, etc) to convert it into a deterministic optimization problem.
A good example of optimization-under-uncertainty that doesn’t look like that (at least, not overtly) is most applications of gradient descent.
The true objective is not well-defined. IE, machine learning people generally can’t write down an objective function which (a) spells out what they want, and (b) can be evaluated. (What you want is generalization accuracy for the presently-unknown deployment data.)
So, machine learning people create proxies to optimize. Training data is the start, but then you add regularizing terms to penalize complex theories.
But none of these proxies is the full expected value (ie, expected generalization accuracy). If we could compute the full expected value, we probably wouldn’t be searching for a model at all! We would just use the EV calculations to make the best decision for each individual case.
So you can see, we can always technically turn optimization-under-uncertainty into a well-defined optimization by providing a prior, but, this is usually so impractical that ML people often don’t even consider what their prior might be. Even if you did write down a prior, you’d probably have to do ordinary ML search to approximate that. Which goes to show that it’s pretty hard to eliminate the non-EV versions of optimization-under-uncertainty; if you try to do real EV, you end up using non-EV methods anyway, to approximate EV.
The fact that we’re not really optimizing EV, in typical applications of gradient descent, explains why methods like early stopping or dropout (or anything else that messes with the ability of gradient descent to optimize the given objective) might be useful. Otherwise, you would only expect to use modifications if they helped the search find higher-value items. But in real cases, we sometimes prefer items that have a lower score on our proxy, when the-way-we-got-that-item gives us other reason to expect it to be good (early stopping being the clearest example of this).
This in turn means we don’t even necessarily convert our problem to a real, solidly defined optimization problem, ever. We can use algorithms like gradient-descent-with-early-stopping just “because they work well” rather than because they optimize some specific quantity we can already compute.
Which also complicates your argument, since if we’re never converting things to well-defined optimization problems, we can’t factor things into “imperfect search problems” vs “alignment given perfect search”—because we’re not really using search algorithms (in the sense of algorithms designed to get the maximum value), we’re using algorithms with a strong family resemblance to search, but which may have a few overtly-suboptimal kinks thrown in because those kinks tend to reduce Goodharting.
In principle, a solution to an optimization-under-uncertainty problem needn’t look like search at all.
Ah, here’s an example: online convex optimization. It’s a solid example of optimization-under-uncertainty, but, not necessarily thought of in terms of a prior and an expectation.
So optimization-under-uncertainty doesn’t necessarily reduce to optimization.
I claim it’s usually better to think about optimization-under-uncertainty in terms of regret bounds, rather than reduce it to maximization. (EG this is why Vanessa’s approach to decision theory is superior.)
I’m not sure the optimization vs optimization-under-uncertainty distinction is actually all that central, though. Intuitively, the reason an objective isn’t well-defined without the data/prior is that the data/prior defines the ontology, or defines what the things-in-the-objective are pointing to (in the pointers-to-values sense) or something along those lines. If the objective function is f(X, Y), then the data/prior are what point “X” and “Y” at some things in the real world. That’s why the objective function cannot be meaningfully separated from the data/prior: “f(X, Y)” doesn’t mean anything, by itself.
But I could imagine the pointer-aspect of the data/prior could somehow be separated from the uncertainty-aspect. Obviously this would require a very different paradigm from either today’s ML or Bayesianism, but if those pieces could be separated, then I could imagine a notion of inner alignment (and possibly also something like robust generalization) which talks about both optimization and uncertainty, plus a notion of outer alignment which just talks about the objective and what it points to. In some ways, I actually like that formulation better, although I’m not clear on exactly what it would mean.
These remarks generally make sense to me. Indeed, I think the ‘uncertainty-aspect’ and the ‘search aspect’ would be rolled up into one, since imperfect search falls under the uncertainty aspect (being logical uncertainty). We might not even be able to point to which parts are prior vs search… as with “inductive bias” in ML. So inner alignment problems would always be “the uncertainty is messed up”—forcibly unifying your search-oriented view on daemons w/ Evan’s prior-oriented view. More generally, we could describe the ‘uncertainty’ part as where ‘capabilities’ live.
Naturally, this strikes me as related to what I’m trying to get at with optimization-under-uncertainty. An optimization-under-uncertainty algorithm takes a pointer, and provides all the ‘uncertainty’.
But I don’t think it should quite be about separating the pointer-aspect and the uncertainty-aspect. The uncertainty aspect has what I’ll call “mundane issues” (eg, does it converge well given evidence, does it keep uncertainty broad w/o evidence) and “extraordinary issues” (inner optimizers). Mundane issues can be investigated with existing statistical tools/concepts. But the extraordinary issues seem to require new concepts. The mundane issues have to do with things like averages and limit frequencies. The extraordinary issues have to do with one-time events.
The true heart of the problem is these “extraordinary issues”.
While I agree that outer objective, training data and prior should be considered together, I disagree that it makes the inner alignment problem dissolve except for manipulation of the search. In principle, if you could indeed ensure through a smart choice of these three parameters that there is only one global optimum, only “bad” (meaning high loss) local minima, and that your search process will always reach the global optimum, then I would agree that the inner alignment problem disappears.
But answering “what do we even want?” at this level of precision seems basically impossible. I expect that it’s pretty much equivalent to specifying exactly the result we want, which we are quite unable to do in general.
So my perspective is that the inner alignment problem appears because of inherent limits into our outer alignment capabilities. And that in realistic settings where we cannot rule out multiple very good local minima, the sort of reasoning underpinning the inner alignment discussion is the best approach we have to address such problems.
That being said, I’m not sure how this view interacts with yours or Evan’s, or if this is a very standard use of the terms. But since that’s part of the discussion Abram is pushing, here is how I use these terms.
Hm, I want to classify “defense against adversaries” as a separate category from both “inner alignment” and “outer alignment”.
The obvious example is: if an adversarial AGI hacks into my AGI and changes its goals, that’s not any kind of alignment problem, it’s a defense-against-adversaries problem.
Then I would take that notion and extend it by saying “yes interacting with an adversary presents an attack surface, but also merely imagining an adversary presents an attack surface too”. Well, at least in weird hypotheticals. I’m not convinced that this would really be a problem in practice, but I dunno, I haven’t thought about it much.
Anyway, I would propose that the procedure for defense against adversaries in general is: (1) shelter an AGI from adversaries early in training, until it’s reasonably intelligent and aligned, and then (2) trust the AGI to defend itself. I’m not sure we can do any better than that.
In particular, I imagine an intelligent and self-aware AGI that’s aligned in trying to help me would deliberately avoid imagining an adversarial superintelligence that can acausally hijack its goals!
That still leaves the issue of early training, when the AGI is not yet motivated to not imagine adversaries, or not yet able. So I would say: if it does imagine the adversary, and then its goals do get hijacked, then at that point I would say “OK yes now it’s misaligned”. (Just like if a real adversary is exploiting a normal security hole—I would say the AGI is aligned before the adversary exploits that hole, and misaligned after.) Then what? Well, presumably, we will need to have procedure that verifies alignment before we release the AGI from its training box. And that procedure would presumably be indifferent to how the AGI came to be misaligned. So I don’t think that’s really a special problem we need to think about.
That still leaves the issue of early training, when the AGI is not yet motivated to not imagine adversaries, or not yet able. So I would say: if it does imagine the adversary, and then its goals do get hijacked, then at that point I would say “OK yes now it’s misaligned”. (Just like if a real adversary is exploiting a normal security hole—I would say the AGI is aligned before the adversary exploits that hole, and misaligned after.) Then what? Well, presumably, we will need to have procedure that verifies alignment before we release the AGI from its training box. And that procedure would presumably be indifferent to how the AGI came to be misaligned. So I don’t think that’s really a special problem we need to think about.
This part doesn’t necessarily make sense, because prevention could be easier than after-the-fact measures. In particular,
You might be unable to defend against arbitrarily adversarial cognition, so, you might want to prevent it early rather than try to detect it later, because you may be vulnerable in between.
You might be able to detect some sorts of misalignment, but not others. In particular, it might be very difficult to detect purposeful deception, since it intelligently evades whatever measures are in place. So your misalignment-detection may be dependent on averting mesa-optimizers or specific sorts of mesa-optimizers.
That’s fair. Other possible approaches are “try to ensure that imagining dangerous adversarial intelligences is aversive to the AGI-in-training ASAP, such that this motivation is installed before the AGI is able to do so”, or “intepretability that looks for the AGI imagining dangerous adversarial intelligences”.
I guess the fact that people don’t tend to get hijacked by imagined adversaries gives me some hope that the first one is feasible—like, that maybe there’s a big window where one is smart enough to understand that imagining adversarial intelligences can be bad, but not smart enough to do so with such fidelity that it actuality is dangerous.
But hard to say what’s gonna work, if anything, at least at my current stage of general ignorance about the overall training process.
I think one major reason why people don’t tend to get hijacked by imagined adversaries is that you can’t simulate someone who is smarter than you, and therefore you can defend against anything you can simulate in your mind.
This is not a perfect arugment since I can imagine someone that has power over me in the real world, and for example imagine how angry they would be at me if I did something they did not like. But then their power over me comes from their power in the real world, not their ability to outsmart me inside my own mind.
Not to disagree hugely, but I have heard one religious conversion (an enlightenment type experience) described in a way that fits with “takeover without holding power over someone”. Specifically this person described enlightenment in terms close to “I was ready to pack my things and leave. But the poison was already in me. My self died soon after that.”
It’s possible to get the general flow of the arguments another person would make, spontaneously produce those arguments later, and be convinced by them (or at least influenced).
I’ll make a case here that manipulation of imperfect internal search should be considered the inner alignment problem, and all the other things which look like inner alignment failures actually stem from outer alignment failure or non-mesa-optimizer-specific generalization failure.
Example: Dr Nefarious
Suppose Dr Nefarious is an agent in the environment who wants to acausally manipulate other agents’ models. We have a model which knows of Dr Nefarious’ existence, and we ask the model to predict what Dr Nefarious will do. At this point, we have already failed: either the model returns a correct answer, in which case Dr Nefarious has acausal control over the answer and can manipulate us through it, or it returns an incorrect answer, in which case the prediction is wrong. (More precisely, the distinction is between informative/independent answers, not correct/incorrect.) The only way to avoid this would be to not ask the question in the first place—but if we need to know what Dr Nefarious will do in order to make good decisions ourselves, then we need to run that query.
On the surface, this looks like an inner alignment failure: there’s a malign subagent in the model. But notice that it’s not even clear what we want in this situation—we don’t know how to write down a goal-specification which avoids the problem while also being useful. The question of “what do we even want to do in this sort of situation?” is unambiguously an outer alignment question. It’s not a situation where we know what we want but we’re not sure how to make a system actually do it; it’s a situation where it’s not even clear what we want.
Conversely, if we did have a good specification of what we want in this situation, then we could just specify that in the outer objective. Once that’s done, we would still potentially need to solve inner alignment problems in practice, but we’d know how to solve them in principle: do the thing which is globally optimal for our outer objective. The whole point of “having a good specification of what we want” is that the globally-optimal thing should be good.
Point of all this: this supposed “inner alignment failure” can be broken into two parts. One of those parts is a “what do we even want?” question, i.e. an outer alignment problem. The other part is a problem of actually achieving the optimal thing, which is where manipulation of imperfect internal search is relevant. If both of those parts are solved, then the system is aligned.
Generalizing The Example
Another example, this time with explicit acausal trade: our AI uses a Solomonoff-like world model, and a subagent in that model is trying to gain influence. Meanwhile, an (unrelated) nefarious agent in the environment wants to manipulate the AI. So, the subagent and the nefarious agent simulate each other and make an acausal deal: the nefarious agent produces a very specific string of bits in the real world, and the subagent gains weight by perfectly predicting that string. In exchange, the subagent manipulates the AI to help the nefarious agent in some way.
Self-fulfilling prophecies provide a similar, but simpler, class of examples.
In each of these, there is a malign inner agent, but that malign inner agent is only able to manipulate the AI successfully because of some structure in the environment. Or, another way to state it: the malign agent is successful only because the combination of (outer) prior + objective does not handle self-fulfilling prophecies or acausal trade the way we (humans) want them to. These are, in an important sense, outer alignment problems: we have not correctly specified what-we-want; even the global optimum of the outer process suffers from the problem.
Objective Is Only Defined With Prior + Data
One possible objection to this is that “outer alignment”—i.e. specifying what-humans-want—should be more narrowly interpreted. In particular, Evan has argued before that generalization errors resulting from e.g. distribution shift between training data and deployment environment should be considered a separate problem.
I disagree with this. I claim that an objective isn’t even well-defined without a distribution; that’s part of the type-signature of an objective.
This is easy to see in the case of an expected utility maximizer. When we write “max E[u(X)]”, X is a variable in the probabilistic model. It is a thing-in-the-model, not a thing-in-the-world; the world does not necessarily share our ontology.
We could say something similar for any setup which maximizes some expected value on an empirical distribution, i.e. an average over the training data. For instance, maybe we have some labeled images, and we’re training a classifier. We may have an objective for which the system does-what-we-want for the original labels, but does not do what we want if we change the objective function to permute the labels before calculating error (i.e. it switches “true” with “false”). Permuting the labels in the objective function is obviously an outer alignment problem—yet we can achieve exactly the same effect by permuting the labels in the dataset instead.
Another angle: plenty of ML work uses the exact same objective on different data sets, and obviously they do completely different things. There is no useful sense in which a training objective can be aligned or misaligned, separate from the context of data/prior.
My point is: there is no line between bad training data and bad objective. These problems only make sense when considered together. So, if “bad training objective” is an outer alignment problem, then we also need to consider “bad training data” to be an outer alignment problem in order for our factorization of the problem to work well. (For Bayesian agents, this also extends to the prior.)
The General Argument
Outer objective, training data, and prior (if any) all have to be treated as a unit: changes in one are equivalent to changes in another, and the objective isn’t even well-defined outside the ontology of the data/prior. The central outer alignment question of “what do we even want?” has to be answered with both an objective and data/prior, in order for the answer to be well-defined.
If we buy that, then outer alignment (i.e. fully answering the question “what do we want?”) implies that the true global optimum of an outer optimizer’s search is aligned. So, there’s only one type of inner alignment problem which would not be solved by solving outer alignment: manipulation of imperfect search. We can have a good objective+prior+data, but the search may still be imperfect, and malign subagents may arise which manipulate that search.
All that said… there’s still an interesting alignment problem which examples like Dr Nefarious or self-fulfilling prophecies or maligness of Solomonoff are pointing to. I claim that inner alignment is not the right way to think about these—it’s not the malign inner agents themselves which are the problem. They’re just an indicator that we have not correctly specified what-we-want.
This is a great comment. I will have to think more about your overall point, but aside from that, you’ve made some really useful distinctions. I’ve been wondering if inner alignment should be defined separately from mesa-optimizer problems, and this seems like more evidence in that direction (ie, the dr nefarious example is a mesa-optimization problem, but it’s about outer alignment). Or maybe inner alignment just shouldn’t be seen as the compliment of outer alignment! Objective quality vs search quality is a nice dividing line, but, doesn’t cluster together the problems people have been trying to cluster together.
Haven’t read the full comment thread, but on this sentence
Evan actually wrote a post to explain that it isn’t the complement for him (and not the compliment either :p)
Right, but John is disagreeing with Evan’s frame, and John’s argument that such-and-such problems aren’t inner alignment problems is that they are outer alignment problems.
So, I think I could write a much longer response to this (perhaps another post), but I’m more or less not persuaded that problems should be cut up the way you say.
As I mentioned in my other reply, your argument that Dr. Nefarious problems shouldn’t be classified as inner alignment is that they are apparently outer alignment. If inner alignment problems are roughly “the internal objective doesn’t match the external objective” and outer alignment problems are roughly “the outer objective doesn’t meet our needs/goals”, then there’s no reason why these have to be mutually exclusive categories.
In particular, Dr. Nefarious problems can be both.
But more importantly, I don’t entirely buy your notion of “optimization”. This is the part that would require a longer explanation to be a proper reply. But basically, I want to distinguish between “optimization” and “optimization under uncertainty”. Optimization under uncertainty is not optimization—that is, it is not optimization of the type you’re describing, where you have a well-defined objective which you’re simply feeding to a search. Given a prior, you can reduce optimization-under-uncertainty to plain optimization (if you can afford the probabilistic inference necessary to take the expectations, which often isn’t the case). But that doesn’t mean that you do, and anyway, I want to keep them as separate concepts even if one is often implemented by the other.
Your notion of the inner alignment problem applies only to optimization.
Evan’s notion of inner alignment applies (only!) to optimization under uncertainty.
I buy the “problems can be both” argument in principle. However, when a problem involves both, it seems like we have to solve the outer part of the problem (i.e. figure out what-we-even-want), and once that’s solved, all that’s left for inner alignment is imperfect-optimizer-exploitation. The reverse does not apply: we do not necessarily have to solve the inner alignment issue (other than the imperfect-optimizer-exploiting part) at all. I also think a version of this argument probably carries over even if we’re thinking about optimization-under-uncertainty, although I’m still not sure exactly what that would mean.
In other words: if a problem is both, then it is useful to think of it as an outer alignment problem (because that part has to be solved regardless), and not also inner alignment (because only a narrower version of that part necessarily has to be solved). In the Dr Nefarious example, the outer misalignment causes the inner misalignment in some important sense—correcting the outer problem fixes the inner problem , but patching the inner problem would leave an outer objective which still isn’t what we want.
I’d be interested in a more complete explanation of what optimization-under-uncertainty would mean, other than to take an expectation (or max/min, quantile, etc) to convert it into a deterministic optimization problem.
I’m not sure the optimization vs optimization-under-uncertainty distinction is actually all that central, though. Intuitively, the reason an objective isn’t well-defined without the data/prior is that the data/prior defines the ontology, or defines what the things-in-the-objective are pointing to (in the pointers-to-values sense) or something along those lines. If the objective function is f(X, Y), then the data/prior are what point “X” and “Y” at some things in the real world. That’s why the objective function cannot be meaningfully separated from the data/prior: “f(X, Y)” doesn’t mean anything, by itself.
But I could imagine the pointer-aspect of the data/prior could somehow be separated from the uncertainty-aspect. Obviously this would require a very different paradigm from either today’s ML or Bayesianism, but if those pieces could be separated, then I could imagine a notion of inner alignment (and possibly also something like robust generalization) which talks about both optimization and uncertainty, plus a notion of outer alignment which just talks about the objective and what it points to. In some ways, I actually like that formulation better, although I’m not clear on exactly what it would mean.
Trying to lay this disagreement out plainly:
According to you, the inner alignment problem should apply to well-defined optimization problems, meaning optimization problems which have been given all the pieces needed to score domain items. Within this frame, the only reasonable definition is “inner” = issues of imperfect search, “outer” = issues of objective (which can include the prior, the utility function, etc).
According to me/Evan, the inner alignment problem should apply to optimization under uncertainty, which is a notion of optimization where you don’t have enough information to really score domain items. In this frame, it seems reasonable to point to the way the algorithm tries to fill in the missing information as the location of “inner optimizers”. This “way the algorithm tries to fill in missing info” has to include properties of the search, so we roll search+prior together into “inductive bias”.
I take your argument to have been:
The strength of well-defined optimization as a natural concept;
The weakness of any factorization which separates elements like prior, data, and loss function, because we really need to consider these together in order to see what task is being set for an ML system (Dr Nefarious demonstrates that the task “prediction” becomes the task “create a catastrophe” if prediction is pointed at the wrong data);
The idea that the my/Evan/Paul’s concern about priors will necessarily be addressed by outer alignment, so does not need to be solved separately.
Your crux is, can we factor ‘uncertainty’ from ‘value pointer’ such that the notion of ‘value pointer’ contains all (and only) the outer alignment issues? In that case, you could come around to optimization-under-uncertainty as a frame.
I take my argument to have been:
The strength of optimization-under-uncertainty as a natural concept (I argue it is more often applicable than well-defined optimization);
The naturalness of referring to problems involving inner optimizers under one umbrella “inner alignment problem”, whether or not Dr Nefarious is involved;
The idea that the malign-prior problem has to be solved in itself whether we group it as an “inner issue” or an “outer issue”;
For myself in particular, I’m ok with some issues-of-prior, such as Dr Nefarious, ending up as both inner alignment and outer alignment in a classification scheme (not overjoyed, but ok with it).
My crux would be, does a solution to outer alignment (in the intuitive sense) really imply a solution to exorcising mesa-optimizers from a prior (in the sense relevant to eliminating them from perfect search)?
It might also help if I point out that well-defined-optimization vs optimization-under-uncertainty is my current version of the selection/control distinction.
In any case, I’m pretty won over by the uncertainty/pointer distinction. I think it’s similar to the capabilities/payload distinction Jessica has mentioned. This combines search and uncertainty (and any other generically useful optimization strategies) into the capabilities.
But I would clarify that, wrt the ‘capabilities’ element, there seem to be mundane capabilities questions and then inner optimizer questions. IE, we might broadly define “inner alignment” to include all questions about how to point ‘capabilities’ at ‘payload’, but if so, I currently think there’s a special subset of ‘inner alignment’ which is about mesa-optimizers. (Evan uses the term ‘inner alignment’ for mesa-optimizer problems, and ‘objective-robustness’ for broader issues of reliably pursuing goals, but he also uses the term ‘capability robustness’, suggesting he’s not lumping all of the capabilities questions under ‘objective robustness’.)
This is a good summary.
I’m still some combination of confused and unconvinced about optimization-under-uncertainty. Some points:
It feels like “optimization under uncertainty” is not quite the right name for the thing you’re trying to point to with that phrase, and I think your explanations would make more sense if we had a better name for it.
The examples of optimization-under-uncertainty from your other comment do not really seem to be about uncertainty per se, at least not in the usual sense, whereas the Dr Nefarious example and maligness of the universal prior do.
Your examples in the other comment do feel closely related to your ideas on learning normativity, whereas inner agency problems do not feel particularly related to that (or at least not any more so than anything else is related to normativity).
It does seem like there’s in important sense in which inner agency problems are about uncertainty, in a way which could potentially be factored out, but that seems less true of the examples in your other comment. (Or to the extent that it is true of those examples, it seems true in a different way than the inner agency examples.)
The pointers problem feels more tightly entangled with your optimization-under-uncertainty examples than with inner agency examples.
… so I guess my main gut-feel at this point is that it does seem very plausible that uncertainty-handling (and inner agency with it) could be factored out of goal-specification (including pointers), but this particular idea of optimization-under-uncertainty seems like it’s capturing something different. (Though that’s based on just a handful of examples, so the idea in your head is probably quite different from what I’ve interpolated from those examples.)
On a side note, it feels weird to be the one saying “we can’t separate uncertainty-handling from goals” and you saying “ok but it seems like goals and uncertainty could somehow be factored”. Usually I expect you to be the one saying uncertainty can’t be separated from goals, and me to say the opposite.
Could you elaborate on that? I do think that learning-normativity is more about outer alignment. However, some ideas might cross-apply.
Well, it still seems like a good name to me, so I’m curious what you are thinking here. What name would communicate better?
Again, I need more unpacking to be able to say much (or update much).
Well, the optimization-under-uncertainty is an attempt to make a frame which can contain both, so this isn’t necessarily a problem… but I am curious what feels non-tight about inner agency.
I still agree with the hypothetical me making the opposite point ;p The problem is that certain things are being conflated, so both “uncertainty can’t be separated from goals” and “uncertainty can be separated from goals” have true interpretations. (I have those interpretations clear in my head, but communication is hard.)
OK, so.
My sense of our remaining disagreement…
We agree that the pointers/uncertainty could be factored (at least informally—currently waiting on any formalism).
You think “optimization under uncertainty” is doing something different, and I think it’s doing something close.
Specifically, I think “optimization under uncertainty” importantly is not necessarily best understood as the standard Bayesian thing where we (1) start with a utility function, (2) provide a prior, so that we can evaluate expected value (and 2.5, update on any evidence), (3) provide a search method, so that we solve the whole thing by searching for the highest-expectation element. Many examples of optimization-under-uncertainty strain this model. Probably the pointer/uncertainty model would do a better job in these cases. But, the Bayesian model is kind of the only one we have, so we can use it provisionally. And when we do so, the approximation of pointer-vs-uncertainty that comes out is:
Pointer: The utility function.
Uncertainty: The search plus the prior, which in practice can blend together into “inductive bias”.
This isn’t perfect, by any means, but, I’m like, “this isn’t so bad, right?”
I mean, I think this approximation is very not-good for talking about the pointers problem. But I think it’s not so bad for talking about inner alignment.
I almost want to suggest that we hold off on trying to resolve this, and first, I write a whole post about “optimization under uncertainty” which clarifies the whole idea and argues for its centrality. However, I kind of don’t have time for that atm.
The way I’m currently thinking of things, I would say the reverse also applies in this case.
We can turn optimization-under-uncertainty into well-defined optimization by assuming a prior. The outer alignment problem (in your sense) involves getting the prior right. Getting the prior right is part of “figuring out what we want”. But this is precisely the source of the inner alignment problems in the paul/evan sense: Paul was pointing out a previously neglected issue about the Solomonoff prior, and Evan is talking about inductive biases of machine learning algorithms (which is sort of like the combination of a prior and imperfect search).
So both you and Evan and Paul are agreeing that there’s this problem with the prior (/ inductive biases). It is distinct from other outer alignment problems (because we can, to a large extent, factor the problem of specifying an expected value calculation into the problem of specifying probabilities and the problem of specifying a value function / utility function / etc). Everyone would seem to agree that this part of the problem needs to be solved. The disagreement is just about whether to classify this part as “inner” and/or “outer”.
What is this problem like? Well, it’s broadly a quality-of-prior problem, but it has a different character from other quality-of-prior problems. For the most part, the quality of priors can be understood by thinking about average error being low, or mistakes becoming infrequent, etc. However, here, this kind of thinking isn’t sufficient: we are concerned with rare but catastrophic errors. Thinking about these things, we find ourselves thinking in terms of “agents inside the prior” (or agents being favored by the inductive biases).
To what extent “agents in the prior” should be lumped together with “agents in imperfect search”, I am not sure. But the term “inner optimizer” seems relevant.
A good example of optimization-under-uncertainty that doesn’t look like that (at least, not overtly) is most applications of gradient descent.
The true objective is not well-defined. IE, machine learning people generally can’t write down an objective function which (a) spells out what they want, and (b) can be evaluated. (What you want is generalization accuracy for the presently-unknown deployment data.)
So, machine learning people create proxies to optimize. Training data is the start, but then you add regularizing terms to penalize complex theories.
But none of these proxies is the full expected value (ie, expected generalization accuracy). If we could compute the full expected value, we probably wouldn’t be searching for a model at all! We would just use the EV calculations to make the best decision for each individual case.
So you can see, we can always technically turn optimization-under-uncertainty into a well-defined optimization by providing a prior, but, this is usually so impractical that ML people often don’t even consider what their prior might be. Even if you did write down a prior, you’d probably have to do ordinary ML search to approximate that. Which goes to show that it’s pretty hard to eliminate the non-EV versions of optimization-under-uncertainty; if you try to do real EV, you end up using non-EV methods anyway, to approximate EV.
The fact that we’re not really optimizing EV, in typical applications of gradient descent, explains why methods like early stopping or dropout (or anything else that messes with the ability of gradient descent to optimize the given objective) might be useful. Otherwise, you would only expect to use modifications if they helped the search find higher-value items. But in real cases, we sometimes prefer items that have a lower score on our proxy, when the-way-we-got-that-item gives us other reason to expect it to be good (early stopping being the clearest example of this).
This in turn means we don’t even necessarily convert our problem to a real, solidly defined optimization problem, ever. We can use algorithms like gradient-descent-with-early-stopping just “because they work well” rather than because they optimize some specific quantity we can already compute.
Which also complicates your argument, since if we’re never converting things to well-defined optimization problems, we can’t factor things into “imperfect search problems” vs “alignment given perfect search”—because we’re not really using search algorithms (in the sense of algorithms designed to get the maximum value), we’re using algorithms with a strong family resemblance to search, but which may have a few overtly-suboptimal kinks thrown in because those kinks tend to reduce Goodharting.
In principle, a solution to an optimization-under-uncertainty problem needn’t look like search at all.
Ah, here’s an example: online convex optimization. It’s a solid example of optimization-under-uncertainty, but, not necessarily thought of in terms of a prior and an expectation.
So optimization-under-uncertainty doesn’t necessarily reduce to optimization.
I claim it’s usually better to think about optimization-under-uncertainty in terms of regret bounds, rather than reduce it to maximization. (EG this is why Vanessa’s approach to decision theory is superior.)
These remarks generally make sense to me. Indeed, I think the ‘uncertainty-aspect’ and the ‘search aspect’ would be rolled up into one, since imperfect search falls under the uncertainty aspect (being logical uncertainty). We might not even be able to point to which parts are prior vs search… as with “inductive bias” in ML. So inner alignment problems would always be “the uncertainty is messed up”—forcibly unifying your search-oriented view on daemons w/ Evan’s prior-oriented view. More generally, we could describe the ‘uncertainty’ part as where ‘capabilities’ live.
Naturally, this strikes me as related to what I’m trying to get at with optimization-under-uncertainty. An optimization-under-uncertainty algorithm takes a pointer, and provides all the ‘uncertainty’.
But I don’t think it should quite be about separating the pointer-aspect and the uncertainty-aspect. The uncertainty aspect has what I’ll call “mundane issues” (eg, does it converge well given evidence, does it keep uncertainty broad w/o evidence) and “extraordinary issues” (inner optimizers). Mundane issues can be investigated with existing statistical tools/concepts. But the extraordinary issues seem to require new concepts. The mundane issues have to do with things like averages and limit frequencies. The extraordinary issues have to do with one-time events.
The true heart of the problem is these “extraordinary issues”.
While I agree that outer objective, training data and prior should be considered together, I disagree that it makes the inner alignment problem dissolve except for manipulation of the search. In principle, if you could indeed ensure through a smart choice of these three parameters that there is only one global optimum, only “bad” (meaning high loss) local minima, and that your search process will always reach the global optimum, then I would agree that the inner alignment problem disappears.
But answering “what do we even want?” at this level of precision seems basically impossible. I expect that it’s pretty much equivalent to specifying exactly the result we want, which we are quite unable to do in general.
So my perspective is that the inner alignment problem appears because of inherent limits into our outer alignment capabilities. And that in realistic settings where we cannot rule out multiple very good local minima, the sort of reasoning underpinning the inner alignment discussion is the best approach we have to address such problems.
That being said, I’m not sure how this view interacts with yours or Evan’s, or if this is a very standard use of the terms. But since that’s part of the discussion Abram is pushing, here is how I use these terms.
Hm, I want to classify “defense against adversaries” as a separate category from both “inner alignment” and “outer alignment”.
The obvious example is: if an adversarial AGI hacks into my AGI and changes its goals, that’s not any kind of alignment problem, it’s a defense-against-adversaries problem.
Then I would take that notion and extend it by saying “yes interacting with an adversary presents an attack surface, but also merely imagining an adversary presents an attack surface too”. Well, at least in weird hypotheticals. I’m not convinced that this would really be a problem in practice, but I dunno, I haven’t thought about it much.
Anyway, I would propose that the procedure for defense against adversaries in general is: (1) shelter an AGI from adversaries early in training, until it’s reasonably intelligent and aligned, and then (2) trust the AGI to defend itself. I’m not sure we can do any better than that.
In particular, I imagine an intelligent and self-aware AGI that’s aligned in trying to help me would deliberately avoid imagining an adversarial superintelligence that can acausally hijack its goals!
That still leaves the issue of early training, when the AGI is not yet motivated to not imagine adversaries, or not yet able. So I would say: if it does imagine the adversary, and then its goals do get hijacked, then at that point I would say “OK yes now it’s misaligned”. (Just like if a real adversary is exploiting a normal security hole—I would say the AGI is aligned before the adversary exploits that hole, and misaligned after.) Then what? Well, presumably, we will need to have procedure that verifies alignment before we release the AGI from its training box. And that procedure would presumably be indifferent to how the AGI came to be misaligned. So I don’t think that’s really a special problem we need to think about.
This part doesn’t necessarily make sense, because prevention could be easier than after-the-fact measures. In particular,
You might be unable to defend against arbitrarily adversarial cognition, so, you might want to prevent it early rather than try to detect it later, because you may be vulnerable in between.
You might be able to detect some sorts of misalignment, but not others. In particular, it might be very difficult to detect purposeful deception, since it intelligently evades whatever measures are in place. So your misalignment-detection may be dependent on averting mesa-optimizers or specific sorts of mesa-optimizers.
That’s fair. Other possible approaches are “try to ensure that imagining dangerous adversarial intelligences is aversive to the AGI-in-training ASAP, such that this motivation is installed before the AGI is able to do so”, or “intepretability that looks for the AGI imagining dangerous adversarial intelligences”.
I guess the fact that people don’t tend to get hijacked by imagined adversaries gives me some hope that the first one is feasible—like, that maybe there’s a big window where one is smart enough to understand that imagining adversarial intelligences can be bad, but not smart enough to do so with such fidelity that it actuality is dangerous.
But hard to say what’s gonna work, if anything, at least at my current stage of general ignorance about the overall training process.
I think one major reason why people don’t tend to get hijacked by imagined adversaries is that you can’t simulate someone who is smarter than you, and therefore you can defend against anything you can simulate in your mind.
This is not a perfect arugment since I can imagine someone that has power over me in the real world, and for example imagine how angry they would be at me if I did something they did not like. But then their power over me comes from their power in the real world, not their ability to outsmart me inside my own mind.
Not to disagree hugely, but I have heard one religious conversion (an enlightenment type experience) described in a way that fits with “takeover without holding power over someone”. Specifically this person described enlightenment in terms close to “I was ready to pack my things and leave. But the poison was already in me. My self died soon after that.”
It’s possible to get the general flow of the arguments another person would make, spontaneously produce those arguments later, and be convinced by them (or at least influenced).