I think it makes complete sense to say something like “once we have enough capability to run AIs making good real-world plans, some moron will run such an AI unsafely”. And that itself implies a startling level of danger. But Eliezer seems to be making a stronger point, that there’s no easy way to run such an AI safely, and all tricks like “ask the AI for plans that succeed conditional on them being executed” fail. And maybe I’m being thick, but the argument for that point still isn’t reaching me somehow. Can someone rephrase for me?
The main issue with this sort of thing (on my understanding of Eliezer’s models) is Hidden Complexity of Wishes. You can make an AI safe by making it only able to fulfill certain narrow, well-defined kinds of wishes where we understand all the details of what we want, but then it probably won’t suffice for a pivotal act. Alternatively, you can make it powerful enough for a pivotal act, but unfortunately a (good) pivotal act probably has to be very big, very irreversible, and very entangled with all the complicated details of human values. So alignment is likely to be a necessary step for a (good) pivotal act.
What this looks-like-in-practice is that “ask the AI for plans that succeed conditional on them being executed” has to be operationalized somehow, and the operationalization will inevitably not correctly capture what we actually want (because “what we actually want” has a ton of hidden complexity).
This is tricky. Let’s say we have a powerful black box that initially has no knowledge or morals, but a lot of malleable computational power. We train it to give answers to scary real-world questions, like how to succeed at business or how to manipulate people. If we reward it for competent answers while we can still understand the answers, at some point we’ll stop understanding answers, but they’ll continue being super-competent. That’s certainly a danger and I agree with it. But by the same token, if we reward the box for aligned answers while we still understand them, the alignment will generalize too. There seems no reason why alignment would be much less learnable than competence about reality.
Maybe your and Eliezer’s point is that competence about reality has a simple core, while alignment doesn’t. But I don’t see the argument for that. Reality is complex, and so are values. A process for learning and acting in reality can have a simple core, but so can a process for learning and acting on values. Humans pick up knowledge from their surroundings, which is part of “general intelligence”, but we pick up values just as easily and using the same circuitry. Where does the symmetry break?
I do think alignment has a relatively-simple core. Not as simple as intelligence/competence, since there’s a decent number of human-value-specific bits which need to be hardcoded (as they are in humans), but not enough to drive the bulk of the asymmetry.
(BTW, I do think you’ve correctly identified an important point which I think a lot of people miss: humans internally “learn” values from a relatively-small chunk of hardcoded information. It should be possible in-principle to specify values with a relatively small set of hardcoded info, similar to the way humans do it; I’d guess fewer than at most 1000 things on the order of complexity of a very fuzzy face detector are required, and probably fewer than 100.)
The reason it’s less learnable than competence is not that alignment is much more complex, but that it’s harder to generate a robust reward signal for alignment. Basically any sufficiently-complex long-term reward signal should incentivize competence. But the vast majority of reward signals do not incentivize alignment. In particular, even if we have a reward signal which is “close” to incentivizing alignment in some sense, the actual-process-which-generates-the-reward-signal is likely to be at least as simple/natural as actual alignment.
(I’ll note that the departure from talking about Hidden Complexity here is mainly because competence in particular is a special case where “complexity” plays almost no role, since it’s incentivized by almost any reward. Hidden Complexity is still usually the right tool for talking about why any particular reward-signal will not incentivize alignment.)
I suspect that Eliezer’s answer to this would be different, and I don’t have a good guess what it would be.
Thinking about it more, it seems that messy reward signals will lead to some approximation of alignment that works while the agent has low power compared to its “teachers”, but at high power it will do something strange and maybe harm the “teachers” values. That holds true for humans gaining a lot of power and going against evolutionary values (“superstimuli”), and for individual humans gaining a lot of power and going against societal values (“power corrupts”), so it’s probably true for AI as well. The worrying thing is that high power by itself seems sufficient for the change, for example if an AI gets good at real-world planning, that constitutes power and therefore danger. And there don’t seem to be any natural counterexamples. So yeah, I’m updating toward your view on this.
OK, let’s say we want an AI to make a “nanobot plan”. I’ll leave aside the possibility of other humans getting access to a similar AI as mine. Then there are two types of accident risk that I need to worry about.
First, I need to worry that the AI may run for a while, then hand me a plan, and it looks like a nanobot plan, but it’s not, it’s a booby trap. To avoid (or at least minimize) that problem, we need to be confident that the AI is actuallytrying to make a nanobot plan—i.e., we need to solve the whole alignment problem.
Alternatively, maybe we’re able to thoroughly understand the plan once we see it; we’re just too stupid to come up with it ourselves. That seems awfully fraught—I’m not sure how we could be so confident that we can tell apart nanobot plans from booby-trap plans. But let’s assume that’s possible for the sake of argument, and then move on to the other type of accident risk:
Second, I need to worry that the AI will start running, and I think it’s coming up with a nanobot plan, but actually it’s hacking its way out of its box and taking over the world.
How and why might that happen?
I would say that if a nanobot plan is very hard to create—requiring new insights etc.—then the only way to do it is to create the nanobot plan is to construct an agent-like thing that is trying to create the nanobot plan.
The agent-like thing would have some kind of action space (e.g. it can choose to summon a particular journal article to re-read, or it can choose to think through a certain possibility, etc.), and it would have some kind of capability of searching for and executing plans (specifically, plans-for-how-to-create-the-nanobot-plan), and it would have a capability of creating and executing instrumental subgoals (e.g. go on a side-quest to better understand boron chemistry) and plausibly it needs some kind of metacognition to improve its ability to find subgoals and take actions.
Everything I mentioned is an “internal” plan or an “internal” action or an “internal” goal, not involving “reaching out into the world” with actuators and internet access and nanobots etc.
If only the AI would stick to such “internal” consequentialist actions (e.g. “I will read this article to better understand boron chemistry”) and not engage in any “external” consequentialist actions (e.g. “I will seize more computer power to better understand boron chemistry”), well then we would have nothing to worry about! Alas, so far as I know, nobody knows how to make a powerful AI agent that would definitely always stick to “internal” consequentialism.
Personally, I’d consider a Fusion Power Generator-like scenario a more central failure mode than either of these. It’s not about the difficulty of getting the AI to do what we asked, it’s about the difficulty of posing the problem in a way which actually captures what we want.
I agree that that is another failure mode. (And there are yet other failure modes too—e.g. instead of printing the nanobot plan, it prints “Help me I’m trapped in a box…” :-P . I apologize for sloppy wording that suggested the two things I mentioned were the only two problems.)
I disagree about “more central”. I think that’s basically a disagreement on the question of “what’s a bigger deal, inner misalignment or outer misalignment?” with you voting for “outer” and me voting for “inner, or maybe tie, I dunno”. But I’m not sure it’s a good use of time to try to hash out that disagreement. We need an alignment plan that solves all the problems simultaneously. Probably different alignment approaches will get stuck on different things.
I think it makes complete sense to say something like “once we have enough capability to run AIs making good real-world plans, some moron will run such an AI unsafely”. And that itself implies a startling level of danger. But Eliezer seems to be making a stronger point, that there’s no easy way to run such an AI safely, and all tricks like “ask the AI for plans that succeed conditional on them being executed” fail.
Yes, I am reading here too that Eliezer seems to be making a stronger point, specifically one related to corrigibility.
Looks like Eliezer believes that (or in Bayesian terms, assigns a high probability to the belief that) corrigibility has not been solved for AGI. He believes it has not been solved for any practically useful value of solved. Furthermore it looks like he expects that progress on solving AGI corrigibility will be slower than progress on creating potentially world-ending AGI. If Eliezer believed that AGI corrigibility had been solved or was close to being solved, I expect he would be in a less dark place than depicted, that he would not be predicting that stolen/leaked AGI code will inevitably doom us when some moron turns it up to 11.
In the transcript above, Eliezer devotes significant space to explaining why he believes that all corrigibility solutions being contemplated now will likely not work. Some choice quotations from the end of the transcript:
[...] corrigibility is anticonvergent / anticoherent / actually moderately strongly contrary to and not just an orthogonal property of a powerful-plan generator.
this is where things get somewhat personal for me:
[...] (And yes, people outside MIRI now and then publish papers saying they totally just solved this problem, but all of those “solutions” are things we considered and dismissed as trivially failing to scale to powerful agents—they didn’t understand what we considered to be the first-order problems in the first place—rather than these being evidence that MIRI just didn’t have smart-enough people at the workshop.)
I am one of `these people outside MIRI’ who have published papers and sequences saying that they have solved large chunks of the AGI corrigibility problem.
I have never been claiming that I ‘totally just solved corrigibility’. I am not sure where Eliezer is finding these ‘totally solved’ people, so I will just ignore that bit and treat it as a rhetorical flourish. But I have indeed been claiming that significant progress has been made on AGI corrigibility in the last few years. In particular, especially in the sequence, I implicitly claim that viewpoints have been developed, outside of MIRI, that address and resolve some of MIRIs main concerns about corrigibility. They resolve these in part by moving beyond Eliezer’s impoverished view of what an AGI-level intelligence is, or must be.
Historical note: around 2019 I spent some time trying to get Eliezier/MIRI interested in updating their viewpoints on how easy or hard corrigibility was. They showed no interest to engage at that time, I have since stopped trying. I do not expect that anything I will say here will update Eliezer, my main motivation to write here is to inform and update others.
I will now point out a probable point of agreement between Eliezer and me. Eliezer says above that corrigibility is a property that is contradictory to having a powerful coherent AGI-level plan generator. Here, coherency has something to do with satisfying a bunch of theorems about how a game-theoretically rational utility maximiser must behave when making plans. One of these theorems is that coherence implies an emergent drive towards self-preservation.
I generally agree with Eliezer that there is a indeed a contradiction here: there is a contradiction between broadly held ideas of what it implies for an AGI to be a coherent utility maximising planner, and broadly held ideas of what it implies for an AGI to be corrigible.
I very much disagree with Eliezier on how hard it is to resolve these contradictions. These contradictions about corrigibility are easy to resolve one you abandon the idea that every AGI must necessarily satisfy various theorems about coherency. Human intelligence definitely does not satisfy various theorems about coherency. Almost all currently implemented AI systems do not satisfy some theorems about coherency, because they will not resist you pressing their off switch.
So this is why I call Eliezer’s view of AGI an impoverished view: Eliezer (at least in the discussion transcript above, and generally whenever I read his stuff) always takes it as axiomatic that an AGI must satisfy certain coherence theorems. Once you take that as axiomatic, it is indeed easy to develop some rather negative opinions about how good other people’s solutions to corrigibility are. Any claimed solution can easily be shown to violate at least one axiom you hold dear. You don’t even need to examine the details of the proposed solution to draw that conclusion.
Various previous proposals for utility indifference have foundered on gotchas like “Well, if we set it up this way, that’s actually just equivalent to the AI assigning probability 0 to the shutdown button ever being pressed, which means that it’ll tend to design the useless button out of itself.” Or, “This AI behaves like the shutdown button gets pressed with a fixed nonzero probability, which means that if, say, that fixed probability is 10%, the AI has an incentive to strongly precommit to making the shutdown button get pressed in cases where the universe doesn’t allow perpetual motion, because that way there’s a nearly 90% probability of perpetual motion being possible.” This tends to be the kind of gotcha you run into, if you try to violate coherence principles; though of course the real and deeper problem is that I expect things contrary to the core of general intelligence to fail to generalize when we try to scale AGI from the safe domains in which feedback can be safely provided, to the unsafe domains in which bad outputs kill the operators before they can label the results.
It’s all very well and good to say “It’s easy to build an AI that believes 2 + 2 = 5 once you relax the coherence constraints of arithmetic!” But the whole central problem is that we have to train an AI when it’s operating in an intrinsically safe domain and intrinsically safe intelligence level where it couldn’t kill the operators if it tried, and then scale that AI to produce outputs in dangerous domains like “Please build a nanosystem”; and if you build a dumb AI that thinks 2 + 2 = 5, and then make it much more intelligent, I strongly suspect that it snaps into in some sense ‘knowing’ or ‘realizing’ or ‘starting to act in important ways as if’ 2 + 2 = 4.
Can you boil down your attempted solution to a statement of which coherence constraint you think it is possible to violate, and what shutdownable behavior you think can result from that, how, in the least fancy actually-precise terms of which your writing talent permits? And then I can say whether my answer is “That doesn’t do what you think” or “Sorry, that coherence violation is a large enough ask that I think any trained patch for it probably fails to scale with general intelligence” or “That does seem non-self-destructive in a certain generalized sense, and the shutdownability might work and scale if we could magically conjure the underlying pattern you posit, though I don’t see how you could get it into Mu Zero per se by being clever with loss functions.” Where the third option there is the one that crosses over into “Well golly that was a large advance compared to the state we occupied in 2015.”
Can you boil down your attempted solution to a statement of which
coherence constraint you think it is possible to violate, and what
shutdownable behavior you think can result from that, how,
Glad you asked.
in the least fancy actually-precise terms of which your writing
talent permits?
If you want actual full precision, I have to refer you to the math in
my papers. Since 2019 I have been working on and off to make this
math more general and accessible, and to find better ways to explain
this math in actually-precise natural language statements. So here is
my best current attempt.
TL;DR: The coherence constraint/theorem/property that I want to
violate is the property that the AGI is using a world model that
accurately depicts the internals of its own compute core. I want to
make the agent use an inaccurate model of these internals, one that
omits the safety shut-down mechanisms I have built inside of its core.
By doing this, I can reduce the emergent incentive of the AGI agent to
disable its own emergency shut-down mechanisms, and the emergent
incentive to stop people from activating them.
I will now expand on this and add more details, using the the example
of an emergency stop button.
Say that the compute core has an emergency stop button attached to
it. Say that actual software running inside the compute core will,
when receiving a stop signal from the button, cause the agent to stop.
When the signal is received, the software will always select and
perform null actions in every future time step. Let’s say that
Mc is a world model that accurately depicts this situation. I am
not going to build an AGI that uses Mc to plan its actions.
Instead I build an AGI agent that will plan its next actions by using
an incorrect world model Mi. This Mi is different from Mc,
but only in how it depicts the internals of the agent compute core.
In the incorrect/imaginary world depicted by Mi, the compute core
has different software in it, software that will ignore the stop
button signal, and just keep on picking actions that maximize utility.
I further construct my AGI so that, in every time step, it calculates
which next action a would maximize utility in this incorrect,
imaginary world Mi. I then further construct it to take this same
action a in the real world.
An AGI that lives in a world that matches the correct model Mc,
while using the incorrect model Mi to decide what actions to take,
will (sometimes) act incoherently. But this is a very specific and
targeted form of incoherence, far less severe and far less dangerous
than having the agent believe that 2+2=5. Still, this incoherency
does have some unwanted side effects. One example side effect, also
considered in the 2015 MIRI corrigibility paper, is that the AGI will
be indifferent about the fate of its emergency stop button. In my
papers, I explore a much broader range of side effects, and look at
how to prevent/suppress/fix them.
So the first step in the above corrigibility design is that you have
to imagine constructing an AGI that will use a specifically incorrect
world model Mi. The first step is that you have to realize that
you actually want this, that you want to decrease coherence in order
to increase corrigibility.
The next step is to look for options to actually construct an AGI that
uses this incorrect Mi.
One option is to train the AGI in a simulated environment that
faithfully represents Mi. We then extract the learned world model,
and put it in an AI that we unleash in the real world. We disable
further learning in this AI, to prevent Mi from being updated
towards Mc. Though this option depicts exactly how many current
real-life AI systems are constructed and get their corrigibility, this
option is not very satisfactory in a theoretical sense. It may not
be a viable path to build certain otherwise desirable types of
AI/AGI/superintelligence.
The more interesting option is to build an AGI that will
construct Mi based on an online training regime in the real world,
in a real world that exactly matches the correct world model Mc.
Is the second option technically feasible? The answer is generally
yes.
A very general solution, leveraging causal models, is in this
sequence.
To give you some idea of the more detailed math involved: the (ls)
pictured below is the real/correct world model Mc, and (p) is
the imaginary/incorrect world model Mi. In the agent defined using
these pictures, certain indifference properties are present trivially,
by construction.
The solution in the pictures above uses ML to learn a model L that
correctly models only a part of the real world, the part of the real
world outside of the agent compute core. We then slot L into a
causal world model, one that has other parts which mis-represent the
real compute core internals. This causal world model will then be
used as the incorrect Mi. Note that this learned L may be a fully
opaque deep neural net, or whatever black-box thing the next
generation of ML might produce. We do not have to understand or edit
any details inside L for this slot-in operation to work.
(I should mention that the paper
has a lot of details not mentioned in the sequence, or visible in the
pictures above. In particular, section 10.2 may be of interest.)
I want to stress that this causal model option is only one possible
route to creating incorrect world models Mi via machine learning in
the real world. Papers like Safely interruptible
agents
and How RL Agents Behave When Their Actions Are
Modified show that the idea of
removing certain events from the training record can also work:
whether this works as intended depends on having the right built-in
priors, priors which control inductive generalization.
So overall, I have a degree of optimism about AGI corrigibility.
That being said, if you want to map out and estimate probabilities for
our possible routes to doom, then you definitely need to include the
scenario where a future superior-to-everything-else type of ML is
invented, where this superior future type of ML just happens to be
incompatible with any of the corrigibility techniques known at that
time. Based on the above work, I put a fairly low probability on that
scenario.
I’m interested in hearing about how your approach handles this environment, because I think I’m getting lost in informal assumptions and symbol-grounding issues when reading about your proposed method.
Read your post, here are my initial impressions on how it relates to
the discussion here.
In your post, you aim to develop a crisp mathematical definition of
(in)coherence, i.e. VNM-incoherence. I like that, looks like a good
way to move forward. Definitely, developing the math further has been
my own approach to de-confusing certain intuitive notions about what
should be possible or not with corrigibility.
However, my first impression is that your
concept
of VNM-incoherence is only weakly related to the meaning that Eliezer
has in mind when he uses the term incoherence. In my view, the four
axioms of
VNM-rationality
have only a very weak descriptive and constraining power when it comes
to defining rational behavior.
I believe that Eliezer’s notion of rationality, and therefore his
notion of coherence above, goes far beyond that implied by the axioms of
VNM-rationality. My feeling is that Eliezer is using the term
‘coherence constraints’ an intuition-pump way where coherence implies, or almost
always implies, that a coherent agent will develop the incentive to
self-preserve.
Looking at your post, I am also having trouble telling exactly how you
are defining VNM-incoherence. You seem to be toying with
several alternative definitions, one where it applies to reward
functions (or preferences over lotteries) which are only allowed to
examine the final state in a 10-step trajectory, another where the
reward function can examine the entire trajectory and maybe the
actions taken to produce that trajectory. I think that your proof
only works in the first case, but fails in the second case. This has
certain (fairly trivial) corollaries about building
corrigibility. I’ll expand on this in a comment I plan to attach to
your post.
I’m interested in hearing about how your approach handles this environment,
I think one way to connect your ABC toy environment to my
approach is to look at sections 3 and 4 of my earlier
paper where I develop a somewhat
similar clarifying toy environment, with running code.
Another comment I can make is that your ABC nodes-and-arrows state
transition diagram is a depiction which makes it hard see how to apply
my approach, because the depiction mashes up the state of the world
outside of the compute core and the state of the world inside the
compute core. If you want to apply counterfactual planning, or if you
want to have a an agent design that can compute the balancing function terms according to Armstrong’s
indifference approach, you need a different depiction of your setup.
You need one which separates out these two state components more explicitely. For example,
make an MDP model where the individual states are instances of the
tuple (physical position of agent in the ABC playing field,policy
function loaded into the compute core).
Not sure how to interpret your statement that you got lost in
symbol-grounding issues. If you can expand on this, I might be able
to help.
Update: I just recalled that Eliezer and MIRI often talk about Dutch booking when they talk about coherence. So not being susceptible to Dutch booking may be the type of coherence Eliezer has in mind here.
When it comes to Dutch booking as a coherence criterion, I need to repeat again the observation I made below:
In general, when you want to think about coherence without getting deeply confused, you need to keep track of what reward function you are using to rule on your coherency criterion. I don’t see that fact mentioned often on this forum, so I will expand.
An agent that plans coherently given a reward function Rp to maximize paperclips will be an incoherent planner if you judge its actions by a reward function Rs that values the maximization of staples instead.
To extend this to Dutch booking: if you train a superintelligent poker playing agent with a reward function that rewards it for losing at poker, you will find that if can be Dutch booked rather easily, if your Dutch booking test is whether you can find a counter-strategy to make it loose money.
I haven’t read your papers but your proposal seems like it would scale up until the point when the AGI looks at itself. If it can’t learn at this point then I find it hard to believe it’s generally capable, and if it can, it will have incentive to simply remove the device or create a copy of itself that is correct about its own world model. Do you address this in the articles?
On the other hand, this made me curious about what we could do with an advanced model that is instructed to not learn and also whether we can even define and ensure a model stops learning.
I haven’t read your papers but your proposal seems like it would scale up until the point when the AGI looks at itself. [...] Do you address this in the articles?
Yes I address this, see for example the part about The possibility of learned self-knowledge in the sequence.
I show there that any RL agent, even a non-AGI, will always have
the latent ability to ‘look at itself’ and create a machine-learned model of its compute core internals.
What is done with this latent ability is up to the designer.
The key thing here is that you have a choice as a designer, you can decide if you want to design an agent which indeed uses this latent ability to ‘look at itself’.
Once you decide that you don’t want to use this latent ability, certain safety/corrigibility problems
become a lot more tractable.
Artificial general intelligence (AGI) is the hypothetical ability of an intelligent agent to understand or learn any intellectual task that a human being can.
Though there is plenty of discussion on this forum which silently assumes
otherwise, there is no law of nature which says that, when I build a useful AGI-level AI, I must necessarily create the entire package of all human cognitive abilities inside of it.
this made me curious about what we could do with an advanced model that is instructed to not learn and also whether we can even define and ensure a model stops learning.
Terminology note if you want to look into this some more:
ML typically does not frame this goal as ‘instructing the model not to
learn about Q’. ML would frame this as ‘building the model to
approximate the specific relation P(X|Y,Z) between some well-defined
observables, and this relation is definitely not Q’.
If you don’t wish to reply to Eliezer, I’m an other and also ask what incoherence allows what corrigibility. I expect counterfactual planning to fail for want of basic interpretability. It would also coherently plan about the planning world—my Eliezer says we might as well equivalently assume superintelligent musings about agency to drive human readers mad.
In general, when you want to think about coherence without getting deeply confused, you need to keep track of what reward function you are using to rule on your coherency criterion. I don’t see that fact mentioned often on this forum, so I will expand.
An agent that plans coherently given a reward function Rp to maximize paperclips will be an incoherent planner if you judge its actions by a reward function Rs that values the maximization of staples instead. In section 6.3 of the paper I show that you can perfectly well interpret a counterfactual planner as an agent that plans coherently even inside its learning world (inside the real world), as long as you are willing to evaluate its coherency according to the somewhat strange reward function Rπ. Armstrong’s indifference methods use this approach to create corrigibility without losing coherency: they construct an equivalent somewhat strange reward function by including balancing terms.
One thing I like about counterfactual planning is that, in my view, it is very interpretable to humans. Humans are very good at predicting what other humans will do, when these other humans are planning coherently inside a specifically incorrect world model, for example in a world model where global warming is a hoax. The same skill can also be applied to interpreting and anticipating the actions of AIs which are counterfactual planners. But maybe I am misunderstanding your concern about interpretability.
Misunderstanding: I expect we can’t construct a counterfactual planner because we can’t pick out the compute core in the black-box learned model.
And my Eliezer’s problem with counterfactual planning is that the plan may start by unleashing a dozen memetic, biological, technological, magical, political and/or untyped existential hazards on the world which then may not even be coordinated correctly when one of your safeguards takes out one of the resulting silicon entities.
we can’t pick out the compute core in the black-box learned model.
Agree it is hard to pick the compute core out of a black-box learned model that includes the compute core.
But one important point I am trying to make in the counterfactual planning sequence/paper is that you do not have to solve that problem. I show that it is tractable to route around it, and still get an AGI.
I don’t understand your second paragraph ‘And my Eliezer’s problem...’. Can you unpack this a bit more? Do you mean that counterfactual planning does not automatically solve the problem of cleaning up an already in-progress mess when you press the emergency stop button too late? It does not intend to, and I do not think that the cleanup issue is among the corrigibility-related problems Eliezer has been emphasizing in the discussion above.
Oh, I wasn’t expecting you to have addressed the issue! 10.2.4 says L wouldn’t be S if it were calculated from projected actions instead of given actions. How so? Mightn’t it predict the given actions correctly?
You’re right on all counts in your last paragraph.
10.2.4 says L wouldn’t be S if it were calculated from projected actions instead of given actions. How so? Mightn’t it predict the given actions correctly?
Not sure if a short answer will help, so I will write a long one.
In 10.2.4 I talk about the possibility of an unwanted learned
predictive function L−(s′,s,a) that makes predictions without using the
argument a. This is possible for example by using s′ together with
a (learned) model πl of the compute core to predict a: so a viable L−
could be defined as L−(s′,s,a)=S(s′,s,πl(s)). This L− could
make predictions fully compatible with the observational record o, but I
claim it would not be a reasonable learned L according to the
reasonableness criterion L≈S. How so?
The reasonableness criterion L≈S is similar to that used in
supervised machine learning: we evaluate the learned L not primarily by how it
matches the training set (how well it predicts the observations in
o), but by evaluating it on a separate test set. This test set can be constructed by sampling S to create samples not contained in o. Mathematically, perfect reasonableness is defined as L=S, which implies that L predicts all samples from S fully accurately.
Philosophically/ontologically speaking, an the agent
specification in my paper, specifically the learning world diagram and the descriptive text around it of how this diagram is a model of reality, gives the engineer an unambiguous prescription of how
they might build experimental equipment that can measure the
properties of the S in the learning world diagram by sampling reality. A version of this equipment must of course be built into the agent, to create the observations that drive machine learning of L,
but another version can be used stand-alone to construct a test set.
A sampling action to construct a member of the test set would set
up a desired state s and action a, and then observe the resulting
s′. Mathematically speaking, this observation gives additional information about the
numeric value of S(s′,s,a) and of all S(s′′,s,a) for all s′′≠s′.
I discuss in the section that, if we take an observational record o sampled from S, then two learned predictive
functions L1 and L2 could be found which are both fully compatible with all
observations in o. So to determine which one might be a more
reasonable approximation of S, we can see how well they would each predict
samples not yet in o.
In the case of section 10.2.4, the crucial experimental test showing that L− is an unreasonable approximation of S is one where we create a test set by setting up an st and an at where we know that at is an action that would definitely not be taken by the real
compute core software running in the agent, when it it encounters
state st. So we set up a test where we expect that at≠πl(st). L− will (likely) mis-predict the outcome of this test. In philosophical/ontological terms, you can read this test as one that (likely) falsifies the claim that L− is a correct theory of S.
As discussed in section 10.2.4, there are parallels between the above rejection test and
the idea of random exploration, where random exploration causes the observational record o, the training set, to already contain observations where at≠πl(st) for any deterministic πl. So this will likely suppress the creation of an unwanted L− via machine learning.
Some background: the symbol grounding issue I discuss in 10.2.4 is very related to the
five-and-ten problem you can find in MIRI’s work on embedded agency.
In my experience, most people in
AI, robotics, statistics, or cyber-physical systems have no problem
seeing the solution to this five-and-ten problem, i.e. how to construct an agent that avoids it But somehow, and I
do not know exactly why, MIRI-style(?) Rationalists keep treating it
as a major open philosophical problem that is ignored by the
mainstream AI/academic community. So you can read section 10.2.4 as
my attempt to review and explain the standard solution to the
five-and-ten problem, as used in statistics and engineering. The section
was partly written with Rationalist readers in mind.
Philosophically speaking, the reasonableness criterion
defined in my paper, and by supervised machine learning, has strong ties to Popper’s view of science and
engineering, which emphasizes falsification via new experiments as the
key method for deciding between competing theories
about the nature of reality. I believe that MIRI-style rationality
de-emphasizes the conceptual tools provided by Popper. Instead it
emphasizes a version of Bayesianism that provides a much more
limited vocabulary to reason about differences between the map and the
territory.
I would be interested to know if the above explanation was helpful to
you, and if so which parts.
There are easy safe ways, but not easy safe useful-enough ways. E.g. you could make your AI output DNA strings for a nanosystem and absolutely do not synthesize them, just have human scientists study them, and that would be a perfectly safe way to develop nanosystems in, say, 20 years instead of 50, except that you won’t make it 2 years without some fool synthesizing the strings and ending the world. And more generally, any pathway that relies on humans achieving deep understanding of the pivotal act will take more than 2 years, unless you make ‘human understanding’ one of the AI’s goals, in which case the AI is optimizing human brains and you’ve lost safety.
I think it makes complete sense to say something like “once we have enough capability to run AIs making good real-world plans, some moron will run such an AI unsafely”. And that itself implies a startling level of danger. But Eliezer seems to be making a stronger point, that there’s no easy way to run such an AI safely, and all tricks like “ask the AI for plans that succeed conditional on them being executed” fail. And maybe I’m being thick, but the argument for that point still isn’t reaching me somehow. Can someone rephrase for me?
The main issue with this sort of thing (on my understanding of Eliezer’s models) is Hidden Complexity of Wishes. You can make an AI safe by making it only able to fulfill certain narrow, well-defined kinds of wishes where we understand all the details of what we want, but then it probably won’t suffice for a pivotal act. Alternatively, you can make it powerful enough for a pivotal act, but unfortunately a (good) pivotal act probably has to be very big, very irreversible, and very entangled with all the complicated details of human values. So alignment is likely to be a necessary step for a (good) pivotal act.
What this looks-like-in-practice is that “ask the AI for plans that succeed conditional on them being executed” has to be operationalized somehow, and the operationalization will inevitably not correctly capture what we actually want (because “what we actually want” has a ton of hidden complexity).
This is tricky. Let’s say we have a powerful black box that initially has no knowledge or morals, but a lot of malleable computational power. We train it to give answers to scary real-world questions, like how to succeed at business or how to manipulate people. If we reward it for competent answers while we can still understand the answers, at some point we’ll stop understanding answers, but they’ll continue being super-competent. That’s certainly a danger and I agree with it. But by the same token, if we reward the box for aligned answers while we still understand them, the alignment will generalize too. There seems no reason why alignment would be much less learnable than competence about reality.
Maybe your and Eliezer’s point is that competence about reality has a simple core, while alignment doesn’t. But I don’t see the argument for that. Reality is complex, and so are values. A process for learning and acting in reality can have a simple core, but so can a process for learning and acting on values. Humans pick up knowledge from their surroundings, which is part of “general intelligence”, but we pick up values just as easily and using the same circuitry. Where does the symmetry break?
I do think alignment has a relatively-simple core. Not as simple as intelligence/competence, since there’s a decent number of human-value-specific bits which need to be hardcoded (as they are in humans), but not enough to drive the bulk of the asymmetry.
(BTW, I do think you’ve correctly identified an important point which I think a lot of people miss: humans internally “learn” values from a relatively-small chunk of hardcoded information. It should be possible in-principle to specify values with a relatively small set of hardcoded info, similar to the way humans do it; I’d guess fewer than at most 1000 things on the order of complexity of a very fuzzy face detector are required, and probably fewer than 100.)
The reason it’s less learnable than competence is not that alignment is much more complex, but that it’s harder to generate a robust reward signal for alignment. Basically any sufficiently-complex long-term reward signal should incentivize competence. But the vast majority of reward signals do not incentivize alignment. In particular, even if we have a reward signal which is “close” to incentivizing alignment in some sense, the actual-process-which-generates-the-reward-signal is likely to be at least as simple/natural as actual alignment.
(I’ll note that the departure from talking about Hidden Complexity here is mainly because competence in particular is a special case where “complexity” plays almost no role, since it’s incentivized by almost any reward. Hidden Complexity is still usually the right tool for talking about why any particular reward-signal will not incentivize alignment.)
I suspect that Eliezer’s answer to this would be different, and I don’t have a good guess what it would be.
Thinking about it more, it seems that messy reward signals will lead to some approximation of alignment that works while the agent has low power compared to its “teachers”, but at high power it will do something strange and maybe harm the “teachers” values. That holds true for humans gaining a lot of power and going against evolutionary values (“superstimuli”), and for individual humans gaining a lot of power and going against societal values (“power corrupts”), so it’s probably true for AI as well. The worrying thing is that high power by itself seems sufficient for the change, for example if an AI gets good at real-world planning, that constitutes power and therefore danger. And there don’t seem to be any natural counterexamples. So yeah, I’m updating toward your view on this.
Speaking for myself here…
OK, let’s say we want an AI to make a “nanobot plan”. I’ll leave aside the possibility of other humans getting access to a similar AI as mine. Then there are two types of accident risk that I need to worry about.
First, I need to worry that the AI may run for a while, then hand me a plan, and it looks like a nanobot plan, but it’s not, it’s a booby trap. To avoid (or at least minimize) that problem, we need to be confident that the AI is actually trying to make a nanobot plan—i.e., we need to solve the whole alignment problem.
Alternatively, maybe we’re able to thoroughly understand the plan once we see it; we’re just too stupid to come up with it ourselves. That seems awfully fraught—I’m not sure how we could be so confident that we can tell apart nanobot plans from booby-trap plans. But let’s assume that’s possible for the sake of argument, and then move on to the other type of accident risk:
Second, I need to worry that the AI will start running, and I think it’s coming up with a nanobot plan, but actually it’s hacking its way out of its box and taking over the world.
How and why might that happen?
I would say that if a nanobot plan is very hard to create—requiring new insights etc.—then the only way to do it is to create the nanobot plan is to construct an agent-like thing that is trying to create the nanobot plan.
The agent-like thing would have some kind of action space (e.g. it can choose to summon a particular journal article to re-read, or it can choose to think through a certain possibility, etc.), and it would have some kind of capability of searching for and executing plans (specifically, plans-for-how-to-create-the-nanobot-plan), and it would have a capability of creating and executing instrumental subgoals (e.g. go on a side-quest to better understand boron chemistry) and plausibly it needs some kind of metacognition to improve its ability to find subgoals and take actions.
Everything I mentioned is an “internal” plan or an “internal” action or an “internal” goal, not involving “reaching out into the world” with actuators and internet access and nanobots etc.
If only the AI would stick to such “internal” consequentialist actions (e.g. “I will read this article to better understand boron chemistry”) and not engage in any “external” consequentialist actions (e.g. “I will seize more computer power to better understand boron chemistry”), well then we would have nothing to worry about! Alas, so far as I know, nobody knows how to make a powerful AI agent that would definitely always stick to “internal” consequentialism.
Personally, I’d consider a Fusion Power Generator-like scenario a more central failure mode than either of these. It’s not about the difficulty of getting the AI to do what we asked, it’s about the difficulty of posing the problem in a way which actually captures what we want.
I agree that that is another failure mode. (And there are yet other failure modes too—e.g. instead of printing the nanobot plan, it prints “Help me I’m trapped in a box…” :-P . I apologize for sloppy wording that suggested the two things I mentioned were the only two problems.)
I disagree about “more central”. I think that’s basically a disagreement on the question of “what’s a bigger deal, inner misalignment or outer misalignment?” with you voting for “outer” and me voting for “inner, or maybe tie, I dunno”. But I’m not sure it’s a good use of time to try to hash out that disagreement. We need an alignment plan that solves all the problems simultaneously. Probably different alignment approaches will get stuck on different things.
Yes, I am reading here too that Eliezer seems to be making a stronger point, specifically one related to corrigibility.
Looks like Eliezer believes that (or in Bayesian terms, assigns a high probability to the belief that) corrigibility has not been solved for AGI. He believes it has not been solved for any practically useful value of solved. Furthermore it looks like he expects that progress on solving AGI corrigibility will be slower than progress on creating potentially world-ending AGI. If Eliezer believed that AGI corrigibility had been solved or was close to being solved, I expect he would be in a less dark place than depicted, that he would not be predicting that stolen/leaked AGI code will inevitably doom us when some moron turns it up to 11.
In the transcript above, Eliezer devotes significant space to explaining why he believes that all corrigibility solutions being contemplated now will likely not work. Some choice quotations from the end of the transcript:
this is where things get somewhat personal for me:
I am one of `these people outside MIRI’ who have published papers and sequences saying that they have solved large chunks of the AGI corrigibility problem.
I have never been claiming that I ‘totally just solved corrigibility’. I am not sure where Eliezer is finding these ‘totally solved’ people, so I will just ignore that bit and treat it as a rhetorical flourish. But I have indeed been claiming that significant progress has been made on AGI corrigibility in the last few years. In particular, especially in the sequence, I implicitly claim that viewpoints have been developed, outside of MIRI, that address and resolve some of MIRIs main concerns about corrigibility. They resolve these in part by moving beyond Eliezer’s impoverished view of what an AGI-level intelligence is, or must be.
Historical note: around 2019 I spent some time trying to get Eliezier/MIRI interested in updating their viewpoints on how easy or hard corrigibility was. They showed no interest to engage at that time, I have since stopped trying. I do not expect that anything I will say here will update Eliezer, my main motivation to write here is to inform and update others.
I will now point out a probable point of agreement between Eliezer and me. Eliezer says above that corrigibility is a property that is contradictory to having a powerful coherent AGI-level plan generator. Here, coherency has something to do with satisfying a bunch of theorems about how a game-theoretically rational utility maximiser must behave when making plans. One of these theorems is that coherence implies an emergent drive towards self-preservation.
I generally agree with Eliezer that there is a indeed a contradiction here: there is a contradiction between broadly held ideas of what it implies for an AGI to be a coherent utility maximising planner, and broadly held ideas of what it implies for an AGI to be corrigible.
I very much disagree with Eliezier on how hard it is to resolve these contradictions. These contradictions about corrigibility are easy to resolve one you abandon the idea that every AGI must necessarily satisfy various theorems about coherency. Human intelligence definitely does not satisfy various theorems about coherency. Almost all currently implemented AI systems do not satisfy some theorems about coherency, because they will not resist you pressing their off switch.
So this is why I call Eliezer’s view of AGI an impoverished view: Eliezer (at least in the discussion transcript above, and generally whenever I read his stuff) always takes it as axiomatic that an AGI must satisfy certain coherence theorems. Once you take that as axiomatic, it is indeed easy to develop some rather negative opinions about how good other people’s solutions to corrigibility are. Any claimed solution can easily be shown to violate at least one axiom you hold dear. You don’t even need to examine the details of the proposed solution to draw that conclusion.
Various previous proposals for utility indifference have foundered on gotchas like “Well, if we set it up this way, that’s actually just equivalent to the AI assigning probability 0 to the shutdown button ever being pressed, which means that it’ll tend to design the useless button out of itself.” Or, “This AI behaves like the shutdown button gets pressed with a fixed nonzero probability, which means that if, say, that fixed probability is 10%, the AI has an incentive to strongly precommit to making the shutdown button get pressed in cases where the universe doesn’t allow perpetual motion, because that way there’s a nearly 90% probability of perpetual motion being possible.” This tends to be the kind of gotcha you run into, if you try to violate coherence principles; though of course the real and deeper problem is that I expect things contrary to the core of general intelligence to fail to generalize when we try to scale AGI from the safe domains in which feedback can be safely provided, to the unsafe domains in which bad outputs kill the operators before they can label the results.
It’s all very well and good to say “It’s easy to build an AI that believes 2 + 2 = 5 once you relax the coherence constraints of arithmetic!” But the whole central problem is that we have to train an AI when it’s operating in an intrinsically safe domain and intrinsically safe intelligence level where it couldn’t kill the operators if it tried, and then scale that AI to produce outputs in dangerous domains like “Please build a nanosystem”; and if you build a dumb AI that thinks 2 + 2 = 5, and then make it much more intelligent, I strongly suspect that it snaps into in some sense ‘knowing’ or ‘realizing’ or ‘starting to act in important ways as if’ 2 + 2 = 4.
Can you boil down your attempted solution to a statement of which coherence constraint you think it is possible to violate, and what shutdownable behavior you think can result from that, how, in the least fancy actually-precise terms of which your writing talent permits? And then I can say whether my answer is “That doesn’t do what you think” or “Sorry, that coherence violation is a large enough ask that I think any trained patch for it probably fails to scale with general intelligence” or “That does seem non-self-destructive in a certain generalized sense, and the shutdownability might work and scale if we could magically conjure the underlying pattern you posit, though I don’t see how you could get it into Mu Zero per se by being clever with loss functions.” Where the third option there is the one that crosses over into “Well golly that was a large advance compared to the state we occupied in 2015.”
Glad you asked.
If you want actual full precision, I have to refer you to the math in my papers. Since 2019 I have been working on and off to make this math more general and accessible, and to find better ways to explain this math in actually-precise natural language statements. So here is my best current attempt.
TL;DR: The coherence constraint/theorem/property that I want to violate is the property that the AGI is using a world model that accurately depicts the internals of its own compute core. I want to make the agent use an inaccurate model of these internals, one that omits the safety shut-down mechanisms I have built inside of its core. By doing this, I can reduce the emergent incentive of the AGI agent to disable its own emergency shut-down mechanisms, and the emergent incentive to stop people from activating them.
I will now expand on this and add more details, using the the example of an emergency stop button.
Say that the compute core has an emergency stop button attached to it. Say that actual software running inside the compute core will, when receiving a stop signal from the button, cause the agent to stop. When the signal is received, the software will always select and perform null actions in every future time step. Let’s say that Mc is a world model that accurately depicts this situation. I am not going to build an AGI that uses Mc to plan its actions.
Instead I build an AGI agent that will plan its next actions by using an incorrect world model Mi. This Mi is different from Mc, but only in how it depicts the internals of the agent compute core. In the incorrect/imaginary world depicted by Mi, the compute core has different software in it, software that will ignore the stop button signal, and just keep on picking actions that maximize utility. I further construct my AGI so that, in every time step, it calculates which next action a would maximize utility in this incorrect, imaginary world Mi. I then further construct it to take this same action a in the real world.
An AGI that lives in a world that matches the correct model Mc, while using the incorrect model Mi to decide what actions to take, will (sometimes) act incoherently. But this is a very specific and targeted form of incoherence, far less severe and far less dangerous than having the agent believe that 2+2=5. Still, this incoherency does have some unwanted side effects. One example side effect, also considered in the 2015 MIRI corrigibility paper, is that the AGI will be indifferent about the fate of its emergency stop button. In my papers, I explore a much broader range of side effects, and look at how to prevent/suppress/fix them.
So the first step in the above corrigibility design is that you have to imagine constructing an AGI that will use a specifically incorrect world model Mi. The first step is that you have to realize that you actually want this, that you want to decrease coherence in order to increase corrigibility.
The next step is to look for options to actually construct an AGI that uses this incorrect Mi.
One option is to train the AGI in a simulated environment that faithfully represents Mi. We then extract the learned world model, and put it in an AI that we unleash in the real world. We disable further learning in this AI, to prevent Mi from being updated towards Mc. Though this option depicts exactly how many current real-life AI systems are constructed and get their corrigibility, this option is not very satisfactory in a theoretical sense. It may not be a viable path to build certain otherwise desirable types of AI/AGI/superintelligence.
The more interesting option is to build an AGI that will construct Mi based on an online training regime in the real world, in a real world that exactly matches the correct world model Mc.
Is the second option technically feasible? The answer is generally yes.
A very general solution, leveraging causal models, is in this sequence. To give you some idea of the more detailed math involved: the (ls) pictured below is the real/correct world model Mc, and (p) is the imaginary/incorrect world model Mi. In the agent defined using these pictures, certain indifference properties are present trivially, by construction.
The solution in the pictures above uses ML to learn a model L that correctly models only a part of the real world, the part of the real world outside of the agent compute core. We then slot L into a causal world model, one that has other parts which mis-represent the real compute core internals. This causal world model will then be used as the incorrect Mi. Note that this learned L may be a fully opaque deep neural net, or whatever black-box thing the next generation of ML might produce. We do not have to understand or edit any details inside L for this slot-in operation to work.
(I should mention that the paper has a lot of details not mentioned in the sequence, or visible in the pictures above. In particular, section 10.2 may be of interest.)
I want to stress that this causal model option is only one possible route to creating incorrect world models Mi via machine learning in the real world. Papers like Safely interruptible agents and How RL Agents Behave When Their Actions Are Modified show that the idea of removing certain events from the training record can also work: whether this works as intended depends on having the right built-in priors, priors which control inductive generalization.
So overall, I have a degree of optimism about AGI corrigibility.
That being said, if you want to map out and estimate probabilities for our possible routes to doom, then you definitely need to include the scenario where a future superior-to-everything-else type of ML is invented, where this superior future type of ML just happens to be incompatible with any of the corrigibility techniques known at that time. Based on the above work, I put a fairly low probability on that scenario.
Apparently no one has actually shown that corrigibility can be VNM-incoherent in any precise sense (and not in the hand-wavy sense which is good for intuition-pumping). I went ahead and sketched out a simple proof of how a reasonable kind of corrigibility gives rise to formal VNM incoherence.
I’m interested in hearing about how your approach handles this environment, because I think I’m getting lost in informal assumptions and symbol-grounding issues when reading about your proposed method.
Read your post, here are my initial impressions on how it relates to the discussion here.
In your post, you aim to develop a crisp mathematical definition of (in)coherence, i.e. VNM-incoherence. I like that, looks like a good way to move forward. Definitely, developing the math further has been my own approach to de-confusing certain intuitive notions about what should be possible or not with corrigibility.
However, my first impression is that your concept of VNM-incoherence is only weakly related to the meaning that Eliezer has in mind when he uses the term incoherence. In my view, the four axioms of VNM-rationality have only a very weak descriptive and constraining power when it comes to defining rational behavior. I believe that Eliezer’s notion of rationality, and therefore his notion of coherence above, goes far beyond that implied by the axioms of VNM-rationality. My feeling is that Eliezer is using the term ‘coherence constraints’ an intuition-pump way where coherence implies, or almost always implies, that a coherent agent will develop the incentive to self-preserve.
Looking at your post, I am also having trouble telling exactly how you are defining VNM-incoherence. You seem to be toying with several alternative definitions, one where it applies to reward functions (or preferences over lotteries) which are only allowed to examine the final state in a 10-step trajectory, another where the reward function can examine the entire trajectory and maybe the actions taken to produce that trajectory. I think that your proof only works in the first case, but fails in the second case. This has certain (fairly trivial) corollaries about building corrigibility. I’ll expand on this in a comment I plan to attach to your post.
I think one way to connect your ABC toy environment to my approach is to look at sections 3 and 4 of my earlier paper where I develop a somewhat similar clarifying toy environment, with running code.
Another comment I can make is that your ABC nodes-and-arrows state transition diagram is a depiction which makes it hard see how to apply my approach, because the depiction mashes up the state of the world outside of the compute core and the state of the world inside the compute core. If you want to apply counterfactual planning, or if you want to have a an agent design that can compute the balancing function terms according to Armstrong’s indifference approach, you need a different depiction of your setup. You need one which separates out these two state components more explicitely. For example, make an MDP model where the individual states are instances of the tuple (physical position of agent in the ABC playing field,policy function loaded into the compute core).
Not sure how to interpret your statement that you got lost in symbol-grounding issues. If you can expand on this, I might be able to help.
Update: I just recalled that Eliezer and MIRI often talk about Dutch booking when they talk about coherence. So not being susceptible to Dutch booking may be the type of coherence Eliezer has in mind here.
When it comes to Dutch booking as a coherence criterion, I need to repeat again the observation I made below:
To extend this to Dutch booking: if you train a superintelligent poker playing agent with a reward function that rewards it for losing at poker, you will find that if can be Dutch booked rather easily, if your Dutch booking test is whether you can find a counter-strategy to make it loose money.
I haven’t read your papers but your proposal seems like it would scale up until the point when the AGI looks at itself. If it can’t learn at this point then I find it hard to believe it’s generally capable, and if it can, it will have incentive to simply remove the device or create a copy of itself that is correct about its own world model. Do you address this in the articles?
On the other hand, this made me curious about what we could do with an advanced model that is instructed to not learn and also whether we can even define and ensure a model stops learning.
Yes I address this, see for example the part about The possibility of learned self-knowledge in the sequence. I show there that any RL agent, even a non-AGI, will always have the latent ability to ‘look at itself’ and create a machine-learned model of its compute core internals.
What is done with this latent ability is up to the designer. The key thing here is that you have a choice as a designer, you can decide if you want to design an agent which indeed uses this latent ability to ‘look at itself’.
Once you decide that you don’t want to use this latent ability, certain safety/corrigibility problems become a lot more tractable.
Wikipedia has the following definition of AGI:
Though there is plenty of discussion on this forum which silently assumes otherwise, there is no law of nature which says that, when I build a useful AGI-level AI, I must necessarily create the entire package of all human cognitive abilities inside of it.
Terminology note if you want to look into this some more: ML typically does not frame this goal as ‘instructing the model not to learn about Q’. ML would frame this as ‘building the model to approximate the specific relation P(X|Y,Z) between some well-defined observables, and this relation is definitely not Q’.
If you don’t wish to reply to Eliezer, I’m an other and also ask what incoherence allows what corrigibility. I expect counterfactual planning to fail for want of basic interpretability. It would also coherently plan about the planning world—my Eliezer says we might as well equivalently assume superintelligent musings about agency to drive human readers mad.
See above for my reply to Eliezer.
Indeed, a counterfactual planner will plan coherently inside its planning world.
In general, when you want to think about coherence without getting deeply confused, you need to keep track of what reward function you are using to rule on your coherency criterion. I don’t see that fact mentioned often on this forum, so I will expand.
An agent that plans coherently given a reward function Rp to maximize paperclips will be an incoherent planner if you judge its actions by a reward function Rs that values the maximization of staples instead. In section 6.3 of the paper I show that you can perfectly well interpret a counterfactual planner as an agent that plans coherently even inside its learning world (inside the real world), as long as you are willing to evaluate its coherency according to the somewhat strange reward function Rπ. Armstrong’s indifference methods use this approach to create corrigibility without losing coherency: they construct an equivalent somewhat strange reward function by including balancing terms.
One thing I like about counterfactual planning is that, in my view, it is very interpretable to humans. Humans are very good at predicting what other humans will do, when these other humans are planning coherently inside a specifically incorrect world model, for example in a world model where global warming is a hoax. The same skill can also be applied to interpreting and anticipating the actions of AIs which are counterfactual planners. But maybe I am misunderstanding your concern about interpretability.
Misunderstanding: I expect we can’t construct a counterfactual planner because we can’t pick out the compute core in the black-box learned model.
And my Eliezer’s problem with counterfactual planning is that the plan may start by unleashing a dozen memetic, biological, technological, magical, political and/or untyped existential hazards on the world which then may not even be coordinated correctly when one of your safeguards takes out one of the resulting silicon entities.
Agree it is hard to pick the compute core out of a black-box learned model that includes the compute core.
But one important point I am trying to make in the counterfactual planning sequence/paper is that you do not have to solve that problem. I show that it is tractable to route around it, and still get an AGI.
I don’t understand your second paragraph ‘And my Eliezer’s problem...’. Can you unpack this a bit more? Do you mean that counterfactual planning does not automatically solve the problem of cleaning up an already in-progress mess when you press the emergency stop button too late? It does not intend to, and I do not think that the cleanup issue is among the corrigibility-related problems Eliezer has been emphasizing in the discussion above.
Oh, I wasn’t expecting you to have addressed the issue! 10.2.4 says L wouldn’t be S if it were calculated from projected actions instead of given actions. How so? Mightn’t it predict the given actions correctly?
You’re right on all counts in your last paragraph.
Not sure if a short answer will help, so I will write a long one.
In 10.2.4 I talk about the possibility of an unwanted learned predictive function L−(s′,s,a) that makes predictions without using the argument a. This is possible for example by using s′ together with a (learned) model πl of the compute core to predict a: so a viable L− could be defined as L−(s′,s,a)=S(s′,s,πl(s)). This L− could make predictions fully compatible with the observational record o, but I claim it would not be a reasonable learned L according to the reasonableness criterion L≈S. How so?
The reasonableness criterion L≈S is similar to that used in supervised machine learning: we evaluate the learned L not primarily by how it matches the training set (how well it predicts the observations in o), but by evaluating it on a separate test set. This test set can be constructed by sampling S to create samples not contained in o. Mathematically, perfect reasonableness is defined as L=S, which implies that L predicts all samples from S fully accurately.
Philosophically/ontologically speaking, an the agent specification in my paper, specifically the learning world diagram and the descriptive text around it of how this diagram is a model of reality, gives the engineer an unambiguous prescription of how they might build experimental equipment that can measure the properties of the S in the learning world diagram by sampling reality. A version of this equipment must of course be built into the agent, to create the observations that drive machine learning of L, but another version can be used stand-alone to construct a test set.
A sampling action to construct a member of the test set would set up a desired state s and action a, and then observe the resulting s′. Mathematically speaking, this observation gives additional information about the numeric value of S(s′,s,a) and of all S(s′′,s,a) for all s′′≠s′.
I discuss in the section that, if we take an observational record o sampled from S, then two learned predictive functions L1 and L2 could be found which are both fully compatible with all observations in o. So to determine which one might be a more reasonable approximation of S, we can see how well they would each predict samples not yet in o.
In the case of section 10.2.4, the crucial experimental test showing that L− is an unreasonable approximation of S is one where we create a test set by setting up an st and an at where we know that at is an action that would definitely not be taken by the real compute core software running in the agent, when it it encounters state st. So we set up a test where we expect that at≠πl(st). L− will (likely) mis-predict the outcome of this test. In philosophical/ontological terms, you can read this test as one that (likely) falsifies the claim that L− is a correct theory of S.
As discussed in section 10.2.4, there are parallels between the above rejection test and the idea of random exploration, where random exploration causes the observational record o, the training set, to already contain observations where at≠πl(st) for any deterministic πl. So this will likely suppress the creation of an unwanted L− via machine learning.
Some background: the symbol grounding issue I discuss in 10.2.4 is very related to the five-and-ten problem you can find in MIRI’s work on embedded agency. In my experience, most people in AI, robotics, statistics, or cyber-physical systems have no problem seeing the solution to this five-and-ten problem, i.e. how to construct an agent that avoids it But somehow, and I do not know exactly why, MIRI-style(?) Rationalists keep treating it as a major open philosophical problem that is ignored by the mainstream AI/academic community. So you can read section 10.2.4 as my attempt to review and explain the standard solution to the five-and-ten problem, as used in statistics and engineering. The section was partly written with Rationalist readers in mind.
Philosophically speaking, the reasonableness criterion defined in my paper, and by supervised machine learning, has strong ties to Popper’s view of science and engineering, which emphasizes falsification via new experiments as the key method for deciding between competing theories about the nature of reality. I believe that MIRI-style rationality de-emphasizes the conceptual tools provided by Popper. Instead it emphasizes a version of Bayesianism that provides a much more limited vocabulary to reason about differences between the map and the territory.
I would be interested to know if the above explanation was helpful to you, and if so which parts.
+1 to the question.
My current best guess at an answer:
There are easy safe ways, but not easy safe useful-enough ways. E.g. you could make your AI output DNA strings for a nanosystem and absolutely do not synthesize them, just have human scientists study them, and that would be a perfectly safe way to develop nanosystems in, say, 20 years instead of 50, except that you won’t make it 2 years without some fool synthesizing the strings and ending the world. And more generally, any pathway that relies on humans achieving deep understanding of the pivotal act will take more than 2 years, unless you make ‘human understanding’ one of the AI’s goals, in which case the AI is optimizing human brains and you’ve lost safety.