Corrigibility could make things worse

Summary: A Corrigibility method that works for a Pivotal Act AI (PAAI) but fails for a CEV style AI could make things worse. Any implemented Corrigibility method will necessarily be built on top of a set of unexamined implicit assumptions. One of those assumptions could be true for a PAAI, but false for a CEV style AI. The present post outlines one specific scenario where this happens. This scenario involves a Corrigibility method that only works for an AI design, if that design does not imply an identifiable outcome. The method fails when it is applied to an AI design, that does imply an identifiable outcome. When such an outcome does exist, the ``corrigible″ AI will ``explain″ this implied outcome, in a way that makes the designers want to implement that outcome.


The example scenario:

Consider a scenario where a design team has access to a Corrigibility method that works for a PAAI design. A PAAI can have a large impact on the world. For example by helping a design team prevent other AI projects. But there exists no specific outcome, that is implied by a PAAI design. Since there exists no implied outcome for a PAAI to ``explain″ to the designers, this Corrigibility method actually renders a PAAI genuinely corrigible. For some AI designs, the set of assumptions that the design is built on top of, does however imply a specific outcome. Let’s refer to this as the Implied Outcome (IO). This IO can alternatively be viewed as: ``the outcome that a Last Judge would either approve of, or reject″. In other words: consider the Last Judge proposal from the CEV arbital page. If it would make sense to add a Last Judge of this type, to a given AI design, then that AI design has an IO. The IO is the outcome that a Last Judge would either approve of, or reject (for example a successor AI that will either get a thumbs up or a thumbs down). In yet other words: the purpose of adding a Last Judge to an AI design, is to allow someone to render a binary judgment on some outcome. For the rest of this post, that outcome will be referred to as the IO of the AI design in question.

In this scenario, the designers first implement a PAAI that buys time (for example by uploading the design team). For the next step, they have a favoured AI design, that does have an IO. One of the reasons that they are trying to make this new AI corrigible, is that they can’t calculate this IO. And they are not certain that they actually want this IO to be implemented.

Their Corrigibility method always results in an AI that wants to refer back to the designers, before implementing anything. The AI will help a group of designers implement a specific outcome, iff they are all fully informed, and they are all in complete agreement that this outcome should be implemented. The Corrigibility method has a definition of Unacceptable Influence (UI). And the Corrigibility method results in an AI that genuinely wants to avoid exerting any UI. It is however important that the AI is able to communicate with the designers in some way. So the Corrigibility method also includes a definition of Acceptable Explanation (AE).

At some point the AI becomes clever enough to figure out the details of the IO. At that point, it is clever enough to convince the designers that this IO is the objectively correct thing to do, using only methods classified as AE. This ``explanation″ is very effective and results in a very robust conviction, that the IO is the objectively correct thing to do. In particular, this value judgment does not change, when the AI tells the designers what has happened. So, when the AI explains what has happened, the designers do not change their mind about IO. They still consider themselves to have a duty to implement IO. The result is a situation where fully informed designers are fully committed to implementing IO. So the ``corrigible″ AI helps them implement IO.

Basically: when this Corrigibility method is applied to an AI with an IO, then this IO will end up getting implemented. The Corrigibility method works perfectly for any PAAI type AI. But for any AI with an identifiable end goal, the Corrigibility method does not change the outcome (it just adds an ``explanation″ step).

The most recently published version of CEV is Parliamentarian CEV (PCEV). A previous post showed that a successfully implemented PCEV would be massively worse than extinction. Thus, a method that makes a PAAI genuinely Corrigible, could make things worse. It could for example change the outcome from extinction, to something massively worse (by resulting in a bad IO getting implemented. For example along the lines of the IO of PCEV).


A more general danger:

There exists a more general danger, that is not strongly related to the specifics of the ``Explanation versus Influence″ definitional issues, or the ``AI designs with an IO, versus AI designs without an IO″ dichotomy, or the PAAI concept, or the PCEV proposal. Consider the more general case where a design team is relying on a two step process, where some type of ``buying time AI″ is followed by a ``real AI″. In this case, the most serious problem is probably not those assumptions that are analysed beforehand, and that are kept in mind when applying some Corrigibility method to a novel type of AI. The most serious problem is probably the set of unexamined implicit assumptions, that the designers are not aware of. Any Corrigibility method implemented by humans, will be built on top of many such assumptions. And it would in general not be particularly surprising to discover that one of these assumptions happens to be correct for one AI design, but incorrect for another AI design. It seems very unlikely that all of these implicit assumptions are humanly findable, even in principle. This means that even if a Corrigibility method works perfectly for a ``buying time AI″, it will probably never be possible to know whether or not it will actually work for a ``real AI″.

Given that PCEV has already been shown to be massively worse than extinction, it seems unlikely that the IO of PCEV will end up getting implemented. That specific danger has probably been mostly removed. But the field of Alignment Target Analysis is still at a very, very early stage. And PCEV is far from the only dangerous alignment target. In general, the field is very, very far from adequately mitigating the full set of dangers, that are related to someone successfully hitting a bad alignment target (as a tangent, it might make sense to note that a Corrigibility method that stops working at the wrong time, is just one specific path amongst many, along which a bad alignment target could end up getting successfully implemented).

Besides being at a very early stage of development, this field of research is also very neglected. At the moment there does not appear to exist any serious research effort dedicated to this risk mitigation strategy. The present post seeks to reduce this neglect, by showing that one can not rely on Corrigibility, for protection against scenarios where someone successfully hits a bad alignment target (even if we assume that Corrigibility has been successfully implemented in a PAAI).


Assumptions and limitations:

PCEV spent many years as the state of the art alignment target, without anyone noticing that a successfully implemented PCEV would have been massively worse than extinction. There exists many paths along which PCEV could have ended up getting successfully implemented. Thus, absent a solid counterargument, the dangers from successfully hitting a bad alignment target should be seen as serious by default. In other words: after the PCEV incident, the burden of proof is on anyone who would claim, that Alignment Target Analysis is not urgently needed to mitigate a serious risk. A proof of concept that such mitigation is feasible, is that the dangers associated with PCEV was reduced by Alignment Target Analysis. In yet other words: absent a solid counterargument, scenarios where someone successfully hits a bad alignment target, should be treated as a danger that is both serious and possible to mitigate. One way to construct such a counterargument, would be to base it on Corrigibility. For such a counterargument to work, Corrigibility must be feasible. Since Corrigibility must be feasible for such a counterargument to work, the present post could simply assume feasibility, when showing that such a counterargument fails (if Corrigibility is not feasible, then Corrigibility based counterarguments fail due to this lack of feasibility). So, this post simply assumed that Corrigibility is feasible.

Since the present post assumed feasibility, it did not demonstrate the existence of a serious real world danger, from partially successful Corrigibility methods (if Corrigibility is not feasible, then scenarios along these lines do not actually constitute a real problem. And feasibility was assumed). This post instead simply showed that the Corrigibility concept does not remove the urgent need for Alignment Target Analysis (a previous post showed that dangers from scenarios where someone successfully hits a bad alignment target are both very serious, and also possible to mitigate. Thus, the present post is focusing on showing why one specific class of counterarguments fail. Previous posts have addressed counterarguments based on proposals along the lines of a PAAI, and proposals along the lines of a Last Judge).

It finally makes sense to explicitly note, that if Corrigibility turns out to be feasible, then Corrigibility might have a large, net positive, safety impact. Because the danger illustrated in this post might be smaller than the safety benefits of the Corrigibility concept. (conditioned on feasibility I would tentatively guess that making progress on Corrigibility probably results in a significant net reduction in the probability of a worse-than-extinction outcome)