ThomasCederborg comments on Corrigibility could make things worse

ThomasCederborg 12 Jun 2024 19:29 UTC
LW: 3 AF: 2
3
AF
Thank you for engaging. If this was unclear for you, then I’m sure it was also unclear for others.
The post outlined a scenario where a Corrigibility method works perfectly for one type of AI (an AI that does not imply an identifiable outcome, for example a PAAI). The same Corrigibility method fails completely for another type of AI (an AI that does imply an identifiable outcome, for example PCEV). So the second AI, that does have an IO, is indeed not corrigible.
This Corrigibility method leads to an outcome that is massively worse than extinction. This bad outcome is the result of two things being true, (i): the fully Corrigible first AI made this outcome possible to reach, and (ii): since the Corrigibility method worked perfectly for the first AI, the designers falsely believed that the Corrigibility method would also work for the second AI.
The second AI wants many things. It wants to get an outcome, as close as possible to IO. The Corrigibility method resulted in the second AI also wanting many additional things (such as wanting all explanations it gives to count as AE, even if this makes the explanations less efficient. And wanting to avoid implementing anything, unless informed designers want that thing to be implemented). But in practice the Corrigibility method does not change the outcome in any way (it just adds an ``explanation step″). So I think it makes sense to say that the second AI has ``zero Corrigibility″. The first AI is completely corrigible. And if the designers had only used the Corrigibility method for the first AI, then the Corrigibility method would have worked perfectly.
This is what I was trying to communicate with the first sentence of the post: ``A Corrigibility method that works for a Pivotal Act AI (PAAI) but fails for a CEV style AI could make things worse.″. I could have used that sentence as a title, but I decided against trying to include everything in the title. (I think it is ok to leave information out of the title, as long as the title is accurate. And the title is, in fact, accurate (because the first AI is genuinely corrigible. And things would have turned out a lot better, in the counterfactual situation where no one had developed any form of Corrigibility).)
One possible source of confusion, could be that you are perhaps interpreting this post as referring to some specific example scenario from your sequence. My post was mostly written before you posted your sequence. It is not meant as a comment on any specific AI in your sequence (which is why I don’t link to your sequence). But given that you had just published your sequence, maybe you were expecting my scenario to contain a single, partially corrigible AI (without an IO). That is not the scenario that I was describing in my post.
However, I could actually make the same point using a scenario with a single AI (without an IO), that is partially Corrigible. (there exists a more general danger here, that is not strongly related to the number of AI designs involved). So, here is an attempt to make the same point using such a scenario instead. A possible title for such an alternative post would be: ``a partially corrigible AI could make things worse″. (this is also a standalone scenario. It is also not meant as a response to anything specific in your sequence)
I think that one could reasonably describe Corrigibility as being context dependent. A given AI could be fully Corrigible in one context (such as preventing competing AI projects), and not Corrigible at all in another context (such as discussing Alignment Target Analysis). I think that one could reasonably refer to such an AI as being partially Corrigible. And, as will be shown below, such an AI could lead to a massively worse than extinction outcome.
Summary: Consider the case where a design team uses a Corrigibility method to build an AI Assistant (AIA). The resulting AIA does not have an IO. When the design team try to use the AIA to prevent competing AI projects, everything works perfectly. However, when they try to use the AIA to understand Alignment Target Analysis (ATA), the Corrigibility method fails completely. Let’s try two very rough analogies. Talking to the AIA about shutting down competing AI projects is very roughly analogous to using a djinn that grants wishes, while caring fully about intentions in exactly the right way. But talking to the AIA about ATA is very roughly analogous to using a djinn that grants wishes, while not caring about intentions at all (so, not a djinn with any form of preferred outcome. And not any form of ``malicious story djinn″. But also not a safe djinn).
The AIA always wants to interact in ways that counts as Acceptable Explanation (AE). When the designers ask it to shut down all hardware that is capable of running a powerful AI, it disobeys the order, and explains that human brains can, in theory, be used to run a powerful AI. When they ask it to shut down all non biological hardware that is capable of running a powerful AI, it first asks if it should delete itself, or if it should transfer itself to biological hardware. Etc. In short: it is Corrigible in this context. In particular: while talking about this topic, the definition of AE holds up.
When one of the designers asks the AIA to explain PCEV however, the definition of AE does not hold up. When the AIA was discussing the potential action of shutting down hardware, there were two components that it wanted the designers to understand, (i): the reasons for shutting down this hardware, and (ii): the effects of shutting down this hardware. In the hardware case, the designers already understood (i) reasonably well. So there was no need to explain it. When AIA is asked about PCEV, there are again two components that AIA wants the designers to understand, (i): the reasons for building PCEV, and (ii): the effects of building PCEV. PCEV is a two component thing. It is simultaneously an alignment target, and also a normative moral theory. It turns out that in this case, the designers do not actually understand (i) at all. They do not understand the normative moral theory behind PCEV. So the AIA explains this normative moral theory to the designers. If the AE definition had been perfect, this would not have been a problem. In other words: if the AIA had been completely Corrigible, then this would not have been a problem.
But it turns out that the people that designed the Corrigibility method did not in fact have a sufficiently good understanding of concepts along the lines of: ``normative moral theories″, ``explanations″, ``understanding″, etc (understanding these concepts sufficiently well was a realistic outcome. But in this scenario the designers failed to do so). As a result, the AE definition is not perfect, and the AIA is only partly Corrigible. So, the AIA ``explains″ the ``normative moral theory of PCEV″ until the designers ``understand″ it (using an explanation that counts as AE). This results in designers that feel a moral obligation to implement PCEV, regardless of what the result is. This new moral framework is robust to learning what happened. So, the result is a set of fully informed designers that are fully committed to implementing PCEV. So, the outcome is massively worse than extinction. (one way to reduce the probability of scenarios along these lines, is to make progress on ATA. Thus: even the existence of an AI that seems to be completely corrigible, is not an argument against the need to make progress on ATA. This also works as a way of gesturing at the more general point, that I was trying to gesture at in the first post)
(Wei Dai has been talking about related things for quite a while)
- Nathan Helm-Burger 14 Sep 2024 17:01 UTC
  2 points
  0
  Parent
  From my point of view, you are making an important point that I agree with: corrigibility isn’t uniformly safe for all use cases, you must use it only carefully and in the use-cases it is safe for. I’ve discussed this point with Max a bunch. The key aspect of corrigibility is keeping the operator empowered, and thus is necessarily unsafe in the hands of foolish or malicious operators.
  Examples of good use:
  - further AI alignment research
  - monitoring the web for rogue AGI
  - operating and optimizing a factory production line
  - medical research
  - helping with mundane aspects of government action, like smoothing out a part of a specific bureaucratic process that needed well-described bounded decision-making (e.g. being a DMV assistant, or a tax-evasion investigator who takes no action other than filing reports on suspected misbehavior)
  Examples of bad use:
  - asking the AI to convince you of something, or even just explain a concept persistently until its sure you understand
  - trying to do a highly-world-affecting dramatic and irreversible act, such as a pivotal act
  - trying to implement a value-aligned or PCEV or whatever agent. In fact, trying to create any agent which isn’t just an exact copy of the known-safe current corrigible agent.
  - trying to research and create particularly dangerous technology, such as self-replicating tech that might get out of hand (e.g. synthetic biology, bioweapons). This is a case where the AI succeeding safely at the task is itself a dangerous result! Now you’ve got a potential Bostrom-esque ‘black ball’ technology in hand, even though the AI didn’t malfunction in any way.
- Max Harms 13 Jun 2024 16:08 UTC
  LW: 1 AF: 1
  0
  AF Parent
  Thanks! I now feel unconfused. To briefly echo back the key idea which I heard (and also agree with): a technique which can create a corrigible PAAI might have assumptions which break if that technique is used to make a different kind of AI (i.e. one aimed at CEV). If we call this technique “the Corrigibility method” then we may end up using the Corrigibility method to make AIs that aren’t at all corrigible, but merely seem corrigible, resulting in disaster.
  This is a useful insight! Thanks for clarifying. :)