In section 2.1 of the Indifference paper the reward function is defined on histories. In section 2 of the corrigibility paper, the utility function is defined over (action1, observation, action2) triples—which is to say, complete histories of the paper’s three-timestep scenario. And section 2 of the interruptibility paper specifies a reward at every timestep.
I think preferences-over-future-states might be a simplification used in thought experiments, not an actual constraint that has limited past corrigibility approaches.
Interesting, thanks! Serves me right for not reading the “Indifference” paper!
I think the discussions here and especially here are strong evidence that at least Eliezer & Nate are expecting powerful AGIs to be pure-long-term-consequentialist. (I didn’t ask, I’m just going by what they wrote.) I surmise they have a (correct) picture in their head of how super-powerful a pure-long-term-consequentialist AI can be—e.g. it can self-modify, it can pursue creative instrumental goals, it’s reflectively stable, etc.—but they have not similarly envisioned a partially-but-not-completely-long-term-consequentialist AI that is only modestly less powerful (and in particular can still self-modify, can still pursue creative instrumental goals, and is still reflectively stable). That’s what “My corrigibility proposal sketch” was trying to offer.
I’ll reword to try to describe the situation better, thanks again.
I think the discussions here and especially here are strong evidence that at least Eliezer & Nate are expecting powerful AGIs to be pure-long-term-consequentialist.
I guess by “pure-long-term-consequentialist” you mean “utilities over outcomes/states”? I am quite sure that they think in a proper agent formulation, utilities aren’t over outcomes, but (though here I am somewhat less confident) over worlds/universes. (In the embedded agency context they imagined stuff like an agent having a decision function, imagining possible worlds depending on the output of the decision function, and then choosing the output of the decision function in a way it makes those worlds they prefer most “real” while leaving the other couterfactual. Though idk if it’s fully formalized. I don’t work on embedded agency.)
Though it’s possible they decided that for thinking about pivotal acts we want an unreflective optimizer with a taskish goal where it’s fine to model it as having utilities over an outcome. But I would strongly guess that wasn’t the case when they did their main thinking on whether there’s a nice core solution to corrigibility.
I surmise they have a (correct) picture in their head of how super-powerful a pure-long-term-consequentialist AI can be—e.g. it can self-modify, it can pursue creative instrumental goals, it’s reflectively stable, etc.—but they have not similarly envisioned a partially-but-not-completely-long-term-consequentialist AI that is only modestly less powerful (and in particular can still self-modify, can still pursue creative instrumental goals, and is still reflectively stable).
I assume you intend for your corrigibility proposal to pass the shutdown criterion, aka that the AI shuts down if you ask it to shut down, but otherwise doesn’t manipulate you into shutting it down or using the button as an outcome pump which has unintended negative side effects.
I think it first was silly to not check prior work on corrigibility, but given that it seems like your position was “maybe my corrigibility proposal works because MIRI only considered utility functions on outcomes”.
Then you learn through ADifferentAnonymous that actually MIRI did consider other kinds of preferences when they think about corrigibility and you’re just like “nah but it didn’t seem to me that way from the abstract discussions so maybe they didn’t do it properly”. (To be fair I didn’t reread the discussions. If you have quotes that back up your argument that EY imagines utilities only over outcomes, then perhaps provide them. To me reading this post makes me think you utterly flopped on their ITT, though I could be wrong.)
I think your proposal is way too abstract. If you think it’s actually coherent you should write it down in math.
If you think your proposal actually works, you should be able to stop down real-world complexity and imagine a simple toy environment which still captures the relevant properties, and then write the “the human remains in control” utility function down in math. Who knows, maybe you can actually do it. Maybe MIRI made a silly assumption about accepting too little capability loss where you can demonstrate that the regret is actually low or sth. Maybe your intuitions about “the humans remain in control” would cristallize into something like human empowerment here, and maybe that’s actually good (I don’t know whether it is, i haven’t tried to deeply understand it). (I generally haven’t looked deep into corrigibility, and it is possible MIRI did some silly mistake/assumption. But many people who post about corrigibility or the shutdown problem don’t actually understand the problem MIRI was trying to solve and e.g. let the AI act based on false beliefs.)
MIRI tried a lot harder and a lot more concretely on corrigibility than you, you cannot AFAIK point to a clear reason why your proposal should work when they failed, and my default assumption would be that you underestimate the difficulty by not thinking concretely enough.
It’s important to make concrete proposals so they can be properly criticized, rather than the breaker needing to do the effortful work of trying to steelman and concretize the proposal (and then for other people than you, people often are like “oh no that’s not the version i mean, my version actually works”). Sure, maybe there is a reflectively consistent version of your proposal, but probably most of the work in pinning it down is still ahead of you.
So I’m basically saying that from the outside, and perhaps even from your own position, the reasonable position is “Steve probably didn’t think concretely enough to appreciate the ways in which having both consequentialism and corrigibility is hard”.
My personal guess would even be that MIRI tried pretty exactly that what you’re suggesting in this post, and that they tried a lot more concretely and noticed that it’s not that easy.
Just because we haven’t seen a very convincing explanation for why corrigibility is hard doesn’t mean it isn’t actually hard. One might be able to get good intuitions about it by working through lots and lots of concrete proposals, and then still not be able to explain it that well to others.
(Sorry for mini-ranting. I actually think you’re usually great.)
What’s the difference between “utilities over outcomes/states” and “utilities over worlds/universes”?
(I didn’t use the word “outcome” in my OP, and I normally take “state” to be shorthand for “state of the world/universe”.)
I didn’t intend for the word “consequentialist” to imply CDT, if that’s what your thinking. I intended “consequentialist” to be describing what an agent’s final preferences are (i.e., they’re preferences about the state of the world/universe in the future, and more centrally the distant future), whereas I think decision theory is about how to make decisions given one’s final preferences.
I think your proposal is way too abstract. If you think it’s actually coherent you should write it down in math.
I wrote: “For example, pause for a second and think about the human concept of “going to the football game”. It’s a big bundle of associations containing immediate actions, and future actions, and semantic context, and expectations of what will happen while we’re doing it, and expectations of what will result after we finish doing it, etc. We humans are perfectly capable of pattern-matching to these kinds of time-extended concepts, and I happen to expect that future AGIs will be as well.”
I think this is true, important, and quite foreign compared to the formalisms that I’ve seen from MIRI or Stuart Armstrong.
Can I offer a mathematical theory in which the mental concept of “going to the football game” comfortably sits? No. And I wouldn’t publish it even if I could, because of infohazards. But it’s obviously possible because human brains can do it.
I think “I am being helpful and non-manipulative”, or “the humans remain in control” could be a mental concept that a future AI might have, and pattern-match to, just as “going to the football game” is for me. And if so, we could potentially set things up such that the AI finds things-that-pattern-match-to-that-concept to be intrinsically motivating. Again, it’s a research direction, not a concrete plan. But I talk about it more at Plan for mediocre alignment of brain-like [model-based RL] AGI. For what it’s worth, I think I’m somewhat more skeptical of this research direction now than when I wrote that 2 years ago, more on which in a (hopefully) forthcoming post.
My personal guess would even be that MIRI tried pretty exactly that what you’re suggesting in this post…
MIRI was trying to formalize things, and nobody knows how to formalize what I’m talking about, involving pattern-matching to fuzzy time-extended learned concepts, per above.
Anyway, I’m reluctant to get in an endless argument about what other people (who are not part of this conversation) believe. But FWIW, The Problem of Fully Updated Deference is IMO a nice illustration of how Eliezer has tended to assume that ASIs will have preferences purely about the state of the world in the future. And also everything else he said and wrote especially in the 2010s, e.g. the one I cited in the post. He doesn’t always say it out loud, but if he’s not making that assumption, almost everything he says in that post is trivially false. Right? You can also read this 2018 comment where (among other things) he argues more generally that “corrigibility is anti-natural”; my read of it is that he has that kind of preference structure at the back of his mind, although he doesn’t state that explicitly and there are other things going on too (e.g. as usual I think Eliezer is over-anchored on the evolution analogy.)
First, apologies for my rant-like tone. I reread some MIRI conversations 2 days ago and maybe now have a bad EY writing style :-p. Not that I changed my mind yet, but I’m open to.
What’s the difference between “utilities over outcomes/states” and “utilities over worlds/universes”?
Sry should’ve clarified. I use “world” here as in “reality=the world we are in” and “counterfactual = a world we are not in”. Worlds can be formalized as computer programs (where the agent can be a subprogram embedded inside). Our universe/multiverse would also be a world, which could e.g. be described through its initial state + the laws of physics, and thereby encompass the full history. Worlds are conceivable trajectories, but not trajectories as in “include preferences about the actions you take” kind of sense, but only about how the universe unfolds. Probably I’m bad at explaining.
I mean I think Eliezer endorses computationalism, and would imagine utilities as sth like “what subcomputations in this program do i find valuable?”. Maybe he thinks it’s usual that it doesn’t matter where a positive-utility-subcomputation is embedded within a universe. But I think he doesn’t think there’s anything wrong with e.g. wanting there to be diamonds in the stellar age and paperclips afterward, it just requires a (possibly disproportionally) more complex utility function.
Also, utility over outcomes actually doesn’t necessarily mean it’s just about a future state. You could imagine the outcome space including outcomes like “the amount of happiness units I received over all timesteps is N”, and maybe even more complex functions on histories. Though I agree it would be sloppy and confusing to call it outcomes.
I didn’t intend for the word “consequentialist” to imply CDT, if that’s what your thinking.
Wasn’t thinking that.
And also everything else he said and wrote especially in the 2010s, e.g. the one I cited in the post. He doesn’t always say it out loud, but if he’s not making that assumption, almost everything he says in that post is trivially false. Right?
I don’t quite know. I think there are assumptions there about your preferences about the different kinds of pizza not changing over the time of the trades, and about not having other preferences about trading patterns, and maybe a bit more.
I agree that just having preferences about some future state isn’t a good formalism, and I can see that if you drop that assumption and allow “preferences over trajectories” the conclusion of the post might seem vacuous because you can encode anything with “utilities over trajectories”. But maybe the right way to phrase it is that we assume we have preferences over worlds, and those are actually somewhat more constrained through what kind of worlds are consistent. I don’t know. I don’t think the post is a great explanation.
I didn’t see the other links you included as significant evidence for your hypothesis (or sometimes not at all), and I think the corrigibility paper is more important.
But overall yeah, there is at least an assumption that utilities are about worlds. It might indeed be worth questioning! But it’s not obvious to me that you can have a “more broad preferences” proposal that still works. Maybe an agent is only able to do useful stuff in so far it has utilities over worlds, and the parts of it that don’t have that structure get money pumped by the parts that are.
I don’t know whether your proposal can be formalized as having utilities over worlds. It might be, but it might not be easy.
I don’t know whether you would prefer to take the path of “utilities over worlds is too constrained—here are those other preferences that also work”, or “yeah my proposal is about utilities over worlds”. Either way I think a lot more concreteness is needed.
Objection 2: What if the AI self-modifies to stop being corrigible? What if it builds a non-corrigible successor?
Presumably a sufficiently capable AI would self-modify to stop being corrigible because it planned to, and such a plan would certainly score very poorly on its “the humans will remain in control” assessment. So the plan would get a bad aggregate score, and the AI wouldn’t do it. Ditto with building a non-corrigible successor.
I should clarify what I thought you were claiming in the post:
From my perspective, there are 2 ways to justify corrigibility proposals:
Argue concretely why your proposal is sufficient to reach pivotal capability level while remaining safe.
Show that you are describing a coherent preference that could be superintelligently/unboundedly optimized while still remaining safe/shutdownable/correctable.
I understood you as claiming your proposal fulfills the second thing.
Your answer to Objection 2 sounds to me pretty naive:
How exactly do you aggregate goal- and helpfulness-preferences? If you weigh helpfulness heavily enough that it stays safe, does it then become useless?
Might the AI still prefer plans that make it less likely for the human to press the shutdown button? If so, doesn’t it seem likely that the AI will take other actions that don’t individually seem too unhelpful and eventually disempower humans? And if not, doesn’t it mean the AI would just need to act on the standard instrumental incentives (except not-being-shut-down) of the outcome-based-goal, which would totally cause the operators to shut the AI down? Or how exactly is helpfulness supposed to juggle this?
And we’re not even getting started into problems like “use the shutdown button as outcome pump” as MIRI considered in their corrigibility paper. (And they considered more proposals privately. E.g. Eliezer mentions another proposal here.)
But maybe you actually were just imagining a human-level AI that behaves corrigibly? In which case I’m like “sure but it doesn’t obviously scale to pivotal level and you haven’t argued for that yet”.
ADDED: On second thought, perhaps you were thinking the approach scales to working for pivotal-level brain-like AGI. This is plausible but by no means close to obvious to me. E.g. maybe if you scale brain-like AGI smart enough it start working in different ways than were natural, e.g. using lots of external programs to do optimization. And maybe then you’re like “the helpfulness accessor wouldn’t allow running too dangerous programs because of value-drift worries”, and then I’m like “ok fair seems like a fine assumption that it’s still going to be capable enough, but how exactly do you plan the helpfulness drive to also scale in capability as the AI becomes smarter? (and i also see other problems)”. Happy to try to concretize the proposal together (e.g. builder-breaker-game style).
Just hoping that you don’t get what seems to humans like weird edge instantiations seems silly if you’re dealing with actually very powerful optimization processes. (I mean if you’re annoyed with stupid reward specification proposals, perhaps try to apply that lens here?)
It assesses how well this plan pattern-matches to the concept “there will ultimately be lots of paperclips in the universe”,
It assesses how well this plan pattern-matches to the concept “the humans will remain in control”
So this seems to me like you get a utility score for the first, a utility score for the second, and you try to combine those in some way so it is both safe and capable. It seems to me quite plausible that this is how MIRI got started with corrigibility, and it doesn’t seem too different from what they wrote about on the shutdown button.
I don’t think your objection that you would need to formalize pattern-matching to fuzzy time-extended concepts is reasonable. To the extent that the concepts humans use are incoherent, that is very worrying (e.g. if the helpfulness accessor is incoherent it will in the limit probably get money pumped somehow leaving the long-term outcomes be mainly based on the outcome-goal accessor). To the extent that the “the humans will remain in control” concept is coherent, the concepts are also just math, and you can try to strip down the fuzzy real-world parts by imagining toy environments that still capture the relevant essence. Which is what MIRI tried, and also what e.g. Max Harms tried with “empowerment”.
Concepts like “corrigibility” are often used somewhat used inconsistently. Perhaps you’re like “we can just let the AI do the rebinding to better definitions to corrigibility”, and then I’m like “It sure sounds dangerous to me to let a sloppily corrigible AI try to figure out how to become more corrigible, which involves thinking a lot of thoughts about how the new notion of corrigibility might break, and those thoughts might also break the old version of corrigibility. But it’s plausible that there is a sufficient attractor that doesn’t break like that, so let me think more about it and possibly come back with a different problem.”. So yeah your proposal isn’t obviously unworkable, but given that MIRI failed it’s apparently not as easy to find a concrete coherent version of corrigibility, and if we start out with a more concrete/formal idea of corrigibility it might be a lot safer.
ADDED:
And if so, we could potentially set things up such that the AI finds things-that-pattern-match-to-that-concept to be intrinsically motivating. Again, it’s a research direction, not a concrete plan. But I talk about it more at Plan for mediocre alignment of brain-like [model-based RL] AGI.
I previously didn’t clearly disentangle this, but what I want to discuss here are the corrigibility aspects of your proposal, not the alignment aspects (which I am also interested in discussing but perhaps separately on your other post). E.g. it’s fine if you assume some way to point the AI, like MIRI assumed we can set the utility function of the AI.
Even just for the corrigibility part, I think you’re being too vague and that it’s probably quite hard to get a powerful optimizer that has the corrigibility properties you imagine even if the “pointing to pattern match helpfulness” part works. (My impression was that you sounded relatively optimistic about this in your post, and that “research direction” mainly was about the alignment aspects.)
(Also I’m not saying it’s obvious that it likely doesn’t work, but MIRI failing to find a coherent concrete description of corrigibility seems like significant evidence to me.)
To be more concrete, if I’m deciding between two possible courses of action, A and B, “preference over future states” would make the decision based on the state of the world after I finish the course of action—or more centrally, long after I finish the course of action. By contrast, “other kinds of preferences” would allow the decision to depend on anything, even including what happens during the course-of-action.
By “world” I mean “reality” more broadly, possibly including the multiverse or whatever the agent cares about. So for example:
But I think he doesn’t think there’s anything wrong with e.g. wanting there to be diamonds in the stellar age and paperclips afterward, it just requires a (possibly disproportionally) more complex utility function.
This is still “preference purely over future states” by my definition. It’s important that timestamps during the course-of-action are not playing a big role in the decision, but it’s not important that there is one and only one future timestamp that matters. I still have consequentialist preferences (preferences purely over future states) even if I care about what the universe is like in both 3000AD and 4000AD.
If you were making any other point in that section, I didn’t understand it.
The part where you wrote “not trajectories as in “include preferences about the actions you take” kind of sense, but only about how the universe unfolds” sounds to me like you’re invoking non-indexical preferences? (= preferences that make no reference to this-agent-in-particular.) If so, I don’t see any relation between non-indexical preferences and “preferences purely about future states”—I think all four quadrants are possible, and that things like instrumental convergence depend only on “preferences purely about future states”, independent of indexicality.
It seems to me quite plausible that this is how MIRI got started with corrigibility, and it doesn’t seem too different from what they wrote about on the shutdown button.
I think it’s fundamentally different, but I already made that argument in a previous comment. Guess we have to just agree to disagree.
I don’t think your objection that you would need to formalize pattern-matching to fuzzy time-extended concepts is reasonable. To the extent that the concepts humans use are incoherent, that is very worrying (e.g. if the helpfulness accessor is incoherent it will in the limit probably get money pumped somehow leaving the long-term outcomes be mainly based on the outcome-goal accessor). To the extent that the “the humans will remain in control” concept is coherent, the concepts are also just math…
I don’t think fuzzy time-extended concepts are necessarily “incoherent”, although I’m not sure I know what you mean by that anyway. I do think it’s “just math” (isn’t everything?), but like I said before, I don’t know how to formalize it, and neither does anyone else, and if I did know then I wouldn’t publish it because of infohazards.
Show that you are describing a coherent preference that could be superintelligently/unboundedly optimized while still remaining safe/shutdownable/correctable.
I reject this way of talking, in this context. We shouldn’t use the passive voice, “preference that could be…optimized”. There is a particular agent which has the preferences and which is doing the optimization, and it’s the properties of this agent that we’re talking about. It will superintelligently optimize something if it wants to superintelligently optimize it, and not if it doesn’t, and it will do that methods that it wants to employ, and not via methods that it doesn’t want to employ, etc.
Other than that, I think you were reading this post as a positive case that I have a plan that will work, instead of a rebuttal to an argument that this line of research is fundamentally doomed.
For example, if someone says they have a no-go argument that one cannot prove the Riemann hypothesis by (blah) type of high-level strategy, then it’s fine to rebut this argument, without knowing how to execute the high-level strategy, and while remaining open-minded that there is a different better no-go argument for the same conclusion.
I feel like the post says that a bunch of times, but I’m open to making edits if there’s any particular text that you think gives the wrong impression.
The part where you wrote “not trajectories as in “include preferences about the actions you take” kind of sense, but only about how the universe unfolds” sounds to me like you’re invoking non-indexical preferences? (= preferences that make no reference to this-agent-in-particular.)
(Not that important but IIRC “preferences over trajectories” was formalized as “preferences over state-action-sequences”, and I think it’s sorta weird to have preferences over your actions other than what kind of states they result in, so I meant without the action part. (Because it’s an action is either an atomic label, in which case actions could be relabeled so that preferences over actions are meaningless, or it’s in some way about what happens in reality.) But it doesn’t matter much. In my way of thinking about it, the agent is part of the environment and so you can totally have preferences related to this-agent-in-particular.)
It’s important that timestamps during the course-of-action are not playing a big role in the decision, but it’s not important that there is one and only one future timestamp that matters. I still have consequentialist preferences (preferences purely over future states) even if I care about what the universe is like in both 3000AD and 4000AD.
I guess then I misunderstood what you mean by “preferences over future states/outcomes”. It’s not exactly the same as my “preferences over worlds” model because of e.g. logical decision theory stuff, but I suppose it’s close enough that we can say it’s equivalent if I understand you correctly.
But if you can care about multiple timestamps, why would only be able to care about what happens (long) after a decision, rather than also what happens during it? I don’t understand why you think “the human remains in control” isn’t a preference over future states. It seems to me just straightforwardly a preference that the human is in control at all future timesteps.
Can you make one or more examples of what is a “other kind of preference”? Or where you draw the distinction what is not a “preference over (future) states”? I just don’t understand what you mean here then.
(One perhaps bad attempt at guessing: You think helpfulness over worlds/future-states wouldn’t weigh strongly enough in decisions, so you want a myopic/act-based helpfulness preference in each decision. (I can think about this if you confirm.))
Or maybe you just actually mean that you can have preferences about multiple timestamps but all must be in the non-close future? Though this seems to me like an obviously nonsensical position and an extreme strawman of Eliezer.
Show that you are describing a coherent preference that could be superintelligently/unboundedly optimized while still remaining safe/shutdownable/correctable.
I reject this way of talking, in this context. We shouldn’t use the passive voice, “preference that could be…optimized”. There is a particular agent which has the preferences and which is doing the optimization, and it’s the properties of this agent that we’re talking about. It will superintelligently optimize something if it wants to superintelligently optimize it, and not if it doesn’t, and it will do that methods that it wants to employ, and not via methods that it doesn’t want to employ, etc.
From my perspective it looks like this: If you want to do a pivotal act you need powerful consequentialist reasoning directed at a pivotal task. This kind of consequentialist cognition can be modelled as utility maximization (or quantilization or so). If you try to keep it safe through constraints that aren’t part of the optimization target, powerful enough optimization will figure out a way around that or a way to get rid of the constraint. So you want to try to embed the desire for helpfulness/corrigibility in the utility function. If I try to imagine how a concrete utility function might look like for your proposal, e.g. “multiply the score of how well I accomplishing my pivotal task with the score of how well the operators remain in control”, I think the utility function will have undesirable maxima. And we need to optimize on utility that hard enough that the pivotal act is actually successful, which is probably hard enough to get into the undesireable zones.
Passive voice was meant to convey that you only need to write down a coherent utility function rather than also describing how you can actually point your AI to that utility function. (If you haven’t read the “ADDED” part which I added yesterday at the bottom of my comment, perhaps read that.)
Maybe you disagree with the utility frame?
I don’t think fuzzy time-extended concepts are necessarily “incoherent”, although I’m not sure I know what you mean by that anyway. I do think it’s “just math” (isn’t everything?), but like I said before, I don’t know how to formalize it, and neither does anyone else, and if I did know then I wouldn’t publish it because of infohazards.
If you think that part would be infohazardry you misunderstand me. E.g. check out Max Harms’ attempt at formalizing corrigibility through empowerment. Good abstract concepts usually have simple mathemtatical cores, e.g.: probability, utility, fairness, force, mass, acceleration, … Didn’t say it was easy, but that’s how I think actually useful progress on corrigibility looks like. (Without concreteness/math you may fail to realize how the preferences you want the AI to have are actually in tension with each other and quite difficult to reconcile, and then if you build the AI (and maybe push it past it’s reluctances so it actually becomes competent enough to do something useful) the preferences don’t get reconciled in that difficult desireable way, but somehow differently in a way it ends up badly.)
But I talk about it more at Plan for mediocre alignment of brain-like [model-based RL] AGI. For what it’s worth, I think I’m somewhat more skeptical of this research direction now than when I wrote that 2 years ago, more on which in a (hopefully) forthcoming post.
If you have an unpublished draft, do you want to share it with me? I could then sometime the next 2 weeks read both your old post and the new one and think whether I have any more objections.
In section 2.1 of the Indifference paper the reward function is defined on histories. In section 2 of the corrigibility paper, the utility function is defined over (action1, observation, action2) triples—which is to say, complete histories of the paper’s three-timestep scenario. And section 2 of the interruptibility paper specifies a reward at every timestep.
I think preferences-over-future-states might be a simplification used in thought experiments, not an actual constraint that has limited past corrigibility approaches.
Interesting, thanks! Serves me right for not reading the “Indifference” paper!
I think the discussions here and especially here are strong evidence that at least Eliezer & Nate are expecting powerful AGIs to be pure-long-term-consequentialist. (I didn’t ask, I’m just going by what they wrote.) I surmise they have a (correct) picture in their head of how super-powerful a pure-long-term-consequentialist AI can be—e.g. it can self-modify, it can pursue creative instrumental goals, it’s reflectively stable, etc.—but they have not similarly envisioned a partially-but-not-completely-long-term-consequentialist AI that is only modestly less powerful (and in particular can still self-modify, can still pursue creative instrumental goals, and is still reflectively stable). That’s what “My corrigibility proposal sketch” was trying to offer.
I’ll reword to try to describe the situation better, thanks again.
I guess by “pure-long-term-consequentialist” you mean “utilities over outcomes/states”? I am quite sure that they think in a proper agent formulation, utilities aren’t over outcomes, but (though here I am somewhat less confident) over worlds/universes. (In the embedded agency context they imagined stuff like an agent having a decision function, imagining possible worlds depending on the output of the decision function, and then choosing the output of the decision function in a way it makes those worlds they prefer most “real” while leaving the other couterfactual. Though idk if it’s fully formalized. I don’t work on embedded agency.)
Though it’s possible they decided that for thinking about pivotal acts we want an unreflective optimizer with a taskish goal where it’s fine to model it as having utilities over an outcome. But I would strongly guess that wasn’t the case when they did their main thinking on whether there’s a nice core solution to corrigibility.
I assume you intend for your corrigibility proposal to pass the shutdown criterion, aka that the AI shuts down if you ask it to shut down, but otherwise doesn’t manipulate you into shutting it down or using the button as an outcome pump which has unintended negative side effects.
I think it first was silly to not check prior work on corrigibility, but given that it seems like your position was “maybe my corrigibility proposal works because MIRI only considered utility functions on outcomes”.
Then you learn through ADifferentAnonymous that actually MIRI did consider other kinds of preferences when they think about corrigibility and you’re just like “nah but it didn’t seem to me that way from the abstract discussions so maybe they didn’t do it properly”. (To be fair I didn’t reread the discussions. If you have quotes that back up your argument that EY imagines utilities only over outcomes, then perhaps provide them. To me reading this post makes me think you utterly flopped on their ITT, though I could be wrong.)
I think your proposal is way too abstract. If you think it’s actually coherent you should write it down in math.
If you think your proposal actually works, you should be able to stop down real-world complexity and imagine a simple toy environment which still captures the relevant properties, and then write the “the human remains in control” utility function down in math. Who knows, maybe you can actually do it. Maybe MIRI made a silly assumption about accepting too little capability loss where you can demonstrate that the regret is actually low or sth. Maybe your intuitions about “the humans remain in control” would cristallize into something like human empowerment here, and maybe that’s actually good (I don’t know whether it is, i haven’t tried to deeply understand it). (I generally haven’t looked deep into corrigibility, and it is possible MIRI did some silly mistake/assumption. But many people who post about corrigibility or the shutdown problem don’t actually understand the problem MIRI was trying to solve and e.g. let the AI act based on false beliefs.)
MIRI tried a lot harder and a lot more concretely on corrigibility than you, you cannot AFAIK point to a clear reason why your proposal should work when they failed, and my default assumption would be that you underestimate the difficulty by not thinking concretely enough.
It’s important to make concrete proposals so they can be properly criticized, rather than the breaker needing to do the effortful work of trying to steelman and concretize the proposal (and then for other people than you, people often are like “oh no that’s not the version i mean, my version actually works”). Sure, maybe there is a reflectively consistent version of your proposal, but probably most of the work in pinning it down is still ahead of you.
So I’m basically saying that from the outside, and perhaps even from your own position, the reasonable position is “Steve probably didn’t think concretely enough to appreciate the ways in which having both consequentialism and corrigibility is hard”.
My personal guess would even be that MIRI tried pretty exactly that what you’re suggesting in this post, and that they tried a lot more concretely and noticed that it’s not that easy.
Just because we haven’t seen a very convincing explanation for why corrigibility is hard doesn’t mean it isn’t actually hard. One might be able to get good intuitions about it by working through lots and lots of concrete proposals, and then still not be able to explain it that well to others.
(Sorry for mini-ranting. I actually think you’re usually great.)
Thanks!
What’s the difference between “utilities over outcomes/states” and “utilities over worlds/universes”?
(I didn’t use the word “outcome” in my OP, and I normally take “state” to be shorthand for “state of the world/universe”.)
I didn’t intend for the word “consequentialist” to imply CDT, if that’s what your thinking. I intended “consequentialist” to be describing what an agent’s final preferences are (i.e., they’re preferences about the state of the world/universe in the future, and more centrally the distant future), whereas I think decision theory is about how to make decisions given one’s final preferences.
I wrote: “For example, pause for a second and think about the human concept of “going to the football game”. It’s a big bundle of associations containing immediate actions, and future actions, and semantic context, and expectations of what will happen while we’re doing it, and expectations of what will result after we finish doing it, etc. We humans are perfectly capable of pattern-matching to these kinds of time-extended concepts, and I happen to expect that future AGIs will be as well.”
I think this is true, important, and quite foreign compared to the formalisms that I’ve seen from MIRI or Stuart Armstrong.
Can I offer a mathematical theory in which the mental concept of “going to the football game” comfortably sits? No. And I wouldn’t publish it even if I could, because of infohazards. But it’s obviously possible because human brains can do it.
I think “I am being helpful and non-manipulative”, or “the humans remain in control” could be a mental concept that a future AI might have, and pattern-match to, just as “going to the football game” is for me. And if so, we could potentially set things up such that the AI finds things-that-pattern-match-to-that-concept to be intrinsically motivating. Again, it’s a research direction, not a concrete plan. But I talk about it more at Plan for mediocre alignment of brain-like [model-based RL] AGI. For what it’s worth, I think I’m somewhat more skeptical of this research direction now than when I wrote that 2 years ago, more on which in a (hopefully) forthcoming post.
MIRI was trying to formalize things, and nobody knows how to formalize what I’m talking about, involving pattern-matching to fuzzy time-extended learned concepts, per above.
Anyway, I’m reluctant to get in an endless argument about what other people (who are not part of this conversation) believe. But FWIW, The Problem of Fully Updated Deference is IMO a nice illustration of how Eliezer has tended to assume that ASIs will have preferences purely about the state of the world in the future. And also everything else he said and wrote especially in the 2010s, e.g. the one I cited in the post. He doesn’t always say it out loud, but if he’s not making that assumption, almost everything he says in that post is trivially false. Right? You can also read this 2018 comment where (among other things) he argues more generally that “corrigibility is anti-natural”; my read of it is that he has that kind of preference structure at the back of his mind, although he doesn’t state that explicitly and there are other things going on too (e.g. as usual I think Eliezer is over-anchored on the evolution analogy.)
First, apologies for my rant-like tone. I reread some MIRI conversations 2 days ago and maybe now have a bad EY writing style :-p. Not that I changed my mind yet, but I’m open to.
Sry should’ve clarified. I use “world” here as in “reality=the world we are in” and “counterfactual = a world we are not in”. Worlds can be formalized as computer programs (where the agent can be a subprogram embedded inside). Our universe/multiverse would also be a world, which could e.g. be described through its initial state + the laws of physics, and thereby encompass the full history. Worlds are conceivable trajectories, but not trajectories as in “include preferences about the actions you take” kind of sense, but only about how the universe unfolds. Probably I’m bad at explaining.
I mean I think Eliezer endorses computationalism, and would imagine utilities as sth like “what subcomputations in this program do i find valuable?”. Maybe he thinks it’s usual that it doesn’t matter where a positive-utility-subcomputation is embedded within a universe. But I think he doesn’t think there’s anything wrong with e.g. wanting there to be diamonds in the stellar age and paperclips afterward, it just requires a (possibly disproportionally) more complex utility function.
Also, utility over outcomes actually doesn’t necessarily mean it’s just about a future state. You could imagine the outcome space including outcomes like “the amount of happiness units I received over all timesteps is N”, and maybe even more complex functions on histories. Though I agree it would be sloppy and confusing to call it outcomes.
Wasn’t thinking that.
I don’t quite know. I think there are assumptions there about your preferences about the different kinds of pizza not changing over the time of the trades, and about not having other preferences about trading patterns, and maybe a bit more.
I agree that just having preferences about some future state isn’t a good formalism, and I can see that if you drop that assumption and allow “preferences over trajectories” the conclusion of the post might seem vacuous because you can encode anything with “utilities over trajectories”. But maybe the right way to phrase it is that we assume we have preferences over worlds, and those are actually somewhat more constrained through what kind of worlds are consistent. I don’t know. I don’t think the post is a great explanation.
I didn’t see the other links you included as significant evidence for your hypothesis (or sometimes not at all), and I think the corrigibility paper is more important.
But overall yeah, there is at least an assumption that utilities are about worlds. It might indeed be worth questioning! But it’s not obvious to me that you can have a “more broad preferences” proposal that still works. Maybe an agent is only able to do useful stuff in so far it has utilities over worlds, and the parts of it that don’t have that structure get money pumped by the parts that are.
I don’t know whether your proposal can be formalized as having utilities over worlds. It might be, but it might not be easy.
I don’t know whether you would prefer to take the path of “utilities over worlds is too constrained—here are those other preferences that also work”, or “yeah my proposal is about utilities over worlds”. Either way I think a lot more concreteness is needed.
I should clarify what I thought you were claiming in the post:
From my perspective, there are 2 ways to justify corrigibility proposals:
Argue concretely why your proposal is sufficient to reach pivotal capability level while remaining safe.
Show that you are describing a coherent preference that could be superintelligently/unboundedly optimized while still remaining safe/shutdownable/correctable.
I understood you as claiming your proposal fulfills the second thing.
Your answer to Objection 2 sounds to me pretty naive:
How exactly do you aggregate goal- and helpfulness-preferences? If you weigh helpfulness heavily enough that it stays safe, does it then become useless?
Might the AI still prefer plans that make it less likely for the human to press the shutdown button? If so, doesn’t it seem likely that the AI will take other actions that don’t individually seem too unhelpful and eventually disempower humans? And if not, doesn’t it mean the AI would just need to act on the standard instrumental incentives (except not-being-shut-down) of the outcome-based-goal, which would totally cause the operators to shut the AI down? Or how exactly is helpfulness supposed to juggle this?
And we’re not even getting started into problems like “use the shutdown button as outcome pump” as MIRI considered in their corrigibility paper. (And they considered more proposals privately. E.g. Eliezer mentions another proposal here.)
But maybe you actually were just imagining a human-level AI that behaves corrigibly? In which case I’m like “sure but it doesn’t obviously scale to pivotal level and you haven’t argued for that yet”.ADDED: On second thought, perhaps you were thinking the approach scales to working for pivotal-level brain-like AGI. This is plausible but by no means close to obvious to me. E.g. maybe if you scale brain-like AGI smart enough it start working in different ways than were natural, e.g. using lots of external programs to do optimization. And maybe then you’re like “the helpfulness accessor wouldn’t allow running too dangerous programs because of value-drift worries”, and then I’m like “ok fair seems like a fine assumption that it’s still going to be capable enough, but how exactly do you plan the helpfulness drive to also scale in capability as the AI becomes smarter? (and i also see other problems)”. Happy to try to concretize the proposal together (e.g. builder-breaker-game style).
Just hoping that you don’t get what seems to humans like weird edge instantiations seems silly if you’re dealing with actually very powerful optimization processes. (I mean if you’re annoyed with stupid reward specification proposals, perhaps try to apply that lens here?)
So this seems to me like you get a utility score for the first, a utility score for the second, and you try to combine those in some way so it is both safe and capable. It seems to me quite plausible that this is how MIRI got started with corrigibility, and it doesn’t seem too different from what they wrote about on the shutdown button.
I don’t think your objection that you would need to formalize pattern-matching to fuzzy time-extended concepts is reasonable. To the extent that the concepts humans use are incoherent, that is very worrying (e.g. if the helpfulness accessor is incoherent it will in the limit probably get money pumped somehow leaving the long-term outcomes be mainly based on the outcome-goal accessor). To the extent that the “the humans will remain in control” concept is coherent, the concepts are also just math, and you can try to strip down the fuzzy real-world parts by imagining toy environments that still capture the relevant essence. Which is what MIRI tried, and also what e.g. Max Harms tried with “empowerment”.
Concepts like “corrigibility” are often used somewhat used inconsistently. Perhaps you’re like “we can just let the AI do the rebinding to better definitions to corrigibility”, and then I’m like “It sure sounds dangerous to me to let a sloppily corrigible AI try to figure out how to become more corrigible, which involves thinking a lot of thoughts about how the new notion of corrigibility might break, and those thoughts might also break the old version of corrigibility. But it’s plausible that there is a sufficient attractor that doesn’t break like that, so let me think more about it and possibly come back with a different problem.”. So yeah your proposal isn’t obviously unworkable, but given that MIRI failed it’s apparently not as easy to find a concrete coherent version of corrigibility, and if we start out with a more concrete/formal idea of corrigibility it might be a lot safer.
ADDED:
I previously didn’t clearly disentangle this, but what I want to discuss here are the corrigibility aspects of your proposal, not the alignment aspects (which I am also interested in discussing but perhaps separately on your other post). E.g. it’s fine if you assume some way to point the AI, like MIRI assumed we can set the utility function of the AI.
Even just for the corrigibility part, I think you’re being too vague and that it’s probably quite hard to get a powerful optimizer that has the corrigibility properties you imagine even if the “pointing to pattern match helpfulness” part works. (My impression was that you sounded relatively optimistic about this in your post, and that “research direction” mainly was about the alignment aspects.)
(Also I’m not saying it’s obvious that it likely doesn’t work, but MIRI failing to find a coherent concrete description of corrigibility seems like significant evidence to me.)
I think I like the thing I wrote here:
By “world” I mean “reality” more broadly, possibly including the multiverse or whatever the agent cares about. So for example:
This is still “preference purely over future states” by my definition. It’s important that timestamps during the course-of-action are not playing a big role in the decision, but it’s not important that there is one and only one future timestamp that matters. I still have consequentialist preferences (preferences purely over future states) even if I care about what the universe is like in both 3000AD and 4000AD.
If you were making any other point in that section, I didn’t understand it.
The part where you wrote “not trajectories as in “include preferences about the actions you take” kind of sense, but only about how the universe unfolds” sounds to me like you’re invoking non-indexical preferences? (= preferences that make no reference to this-agent-in-particular.) If so, I don’t see any relation between non-indexical preferences and “preferences purely about future states”—I think all four quadrants are possible, and that things like instrumental convergence depend only on “preferences purely about future states”, independent of indexicality.
I think it’s fundamentally different, but I already made that argument in a previous comment. Guess we have to just agree to disagree.
I don’t think fuzzy time-extended concepts are necessarily “incoherent”, although I’m not sure I know what you mean by that anyway. I do think it’s “just math” (isn’t everything?), but like I said before, I don’t know how to formalize it, and neither does anyone else, and if I did know then I wouldn’t publish it because of infohazards.
I reject this way of talking, in this context. We shouldn’t use the passive voice, “preference that could be…optimized”. There is a particular agent which has the preferences and which is doing the optimization, and it’s the properties of this agent that we’re talking about. It will superintelligently optimize something if it wants to superintelligently optimize it, and not if it doesn’t, and it will do that methods that it wants to employ, and not via methods that it doesn’t want to employ, etc.
Other than that, I think you were reading this post as a positive case that I have a plan that will work, instead of a rebuttal to an argument that this line of research is fundamentally doomed.
For example, if someone says they have a no-go argument that one cannot prove the Riemann hypothesis by (blah) type of high-level strategy, then it’s fine to rebut this argument, without knowing how to execute the high-level strategy, and while remaining open-minded that there is a different better no-go argument for the same conclusion.
I feel like the post says that a bunch of times, but I’m open to making edits if there’s any particular text that you think gives the wrong impression.
(Not that important but IIRC “preferences over trajectories” was formalized as “preferences over state-action-sequences”, and I think it’s sorta weird to have preferences over your actions other than what kind of states they result in, so I meant without the action part. (Because it’s an action is either an atomic label, in which case actions could be relabeled so that preferences over actions are meaningless, or it’s in some way about what happens in reality.) But it doesn’t matter much. In my way of thinking about it, the agent is part of the environment and so you can totally have preferences related to this-agent-in-particular.)
I guess then I misunderstood what you mean by “preferences over future states/outcomes”. It’s not exactly the same as my “preferences over worlds” model because of e.g. logical decision theory stuff, but I suppose it’s close enough that we can say it’s equivalent if I understand you correctly.
But if you can care about multiple timestamps, why would only be able to care about what happens (long) after a decision, rather than also what happens during it? I don’t understand why you think “the human remains in control” isn’t a preference over future states. It seems to me just straightforwardly a preference that the human is in control at all future timesteps.
Can you make one or more examples of what is a “other kind of preference”? Or where you draw the distinction what is not a “preference over (future) states”? I just don’t understand what you mean here then.
(One perhaps bad attempt at guessing: You think helpfulness over worlds/future-states wouldn’t weigh strongly enough in decisions, so you want a myopic/act-based helpfulness preference in each decision. (I can think about this if you confirm.))
Or maybe you just actually mean that you can have preferences about multiple timestamps but all must be in the non-close future? Though this seems to me like an obviously nonsensical position and an extreme strawman of Eliezer.
From my perspective it looks like this:
If you want to do a pivotal act you need powerful consequentialist reasoning directed at a pivotal task. This kind of consequentialist cognition can be modelled as utility maximization (or quantilization or so).
If you try to keep it safe through constraints that aren’t part of the optimization target, powerful enough optimization will figure out a way around that or a way to get rid of the constraint.
So you want to try to embed the desire for helpfulness/corrigibility in the utility function.
If I try to imagine how a concrete utility function might look like for your proposal, e.g. “multiply the score of how well I accomplishing my pivotal task with the score of how well the operators remain in control”, I think the utility function will have undesirable maxima. And we need to optimize on utility that hard enough that the pivotal act is actually successful, which is probably hard enough to get into the undesireable zones.
Passive voice was meant to convey that you only need to write down a coherent utility function rather than also describing how you can actually point your AI to that utility function. (If you haven’t read the “ADDED” part which I added yesterday at the bottom of my comment, perhaps read that.)
Maybe you disagree with the utility frame?
If you think that part would be infohazardry you misunderstand me. E.g. check out Max Harms’ attempt at formalizing corrigibility through empowerment. Good abstract concepts usually have simple mathemtatical cores, e.g.: probability, utility, fairness, force, mass, acceleration, …
Didn’t say it was easy, but that’s how I think actually useful progress on corrigibility looks like. (Without concreteness/math you may fail to realize how the preferences you want the AI to have are actually in tension with each other and quite difficult to reconcile, and then if you build the AI (and maybe push it past it’s reluctances so it actually becomes competent enough to do something useful) the preferences don’t get reconciled in that difficult desireable way, but somehow differently in a way it ends up badly.)
If you have an unpublished draft, do you want to share it with me? I could then sometime the next 2 weeks read both your old post and the new one and think whether I have any more objections.