I’m all for it! See my post here advocating for research in that direction. I don’t think there’s any known fundamental problem, just that we need to figure out how to do it :-)
For example, with end-to-end training, it’s hard to distinguish the desired “optimize for X then print your plan to the screen” from the super-dangerous “optimize the probability that the human operators thinks they are looking at a plan for X”. (This is probably the kind of inner alignment problem that ofer is referring to.)
I proposed here that maybe we can make this kind of decoupled system with self-supervised learning, although I there are still many open questions about that approach, including the possibility that it’s less safe than it first appears.
Incidentally, I like the idea of mixing Decoupled AI 1 and Decoupled AI 3 to get:
Decoupled AI 5: “Consider the (counterfactual) Earth with no AGIs, and figure out the most probable scenario in which a small group (achieves world peace / cures cancer / whatever), and then describe that scenario.”
I think this one would be likelier to give a reasonable, human-compatible plan on the first try (though you should still ask follow-up questions before actually doing it!).
Any other ideas for “decoupled” AIs, or risks that apply to this approach in general?
If the question is about all the risks that apply, rather than special risks with this specific approach, then I’ll note that the usual risks from the inner alignment problem seem to apply.
Yes, decoupling seems to address a broad class of incentive problems in safety, which includes the shutdown problem and various forms of tampering / wireheading. Other examples of decoupling include causal counterfactual agents and counterfactual reward modeling.
Another design: imitation learning. Generally, there seems to be a pattern of: policies which aren’t selected for on the basis of maximizing some kind of return.
The class of non-agent AI’s (not choosing actions based on the predicted resulting utility) seems very broad. We could choose actions alphabetically, or use an expert system representing the outside view, or use a biased/inaccurate model when predicting consequences, or include preferences about which actions are good or bad in themselves.
I don’t think there’s any general failure mode (there are certainly specific ones), but if we condition on this AI being selected by humans, maybe we select something that’s doing enough optimization that it will take a highly-optimizing action like rewriting itself to be an agent.
Design 2̴ 1 may happen to reply “Convince the director to undecouple the AI design by telling him <convincing argument>.” which could convince the operator that reads it and therefore fail as 3̴ 2 fails.
Design 2̴ 1 may also model distant superintelligences that break out of the box by predictably maximizing paperclips iff we draw a runic circle that, when printed as a plan, convinces the reader or hacks the computer.
Another way of putting this is that a “tool” has an underlying instruction set that conceptually looks like: “(1) Calculate which action A would maximize parameter P, based on existing data set D. (2) Summarize this calculation in a user-friendly manner, including what Action A is, what likely intermediate outcomes it would cause, what other actions would result in high values of P, etc.” An “agent,” by contrast, has an underlying instruction set that conceptually looks like: “(1) Calculate which action, A, would maximize parameter P, based on existing data set D. (2) Execute Action A.” In any AI where (1) is separable (by the programmers) as a distinct step, (2) can be set to the “tool” version rather than the “agent” version, and this separability is in fact present with most/all modern software. Note that in the “tool” version, neither step (1) nor step (2) (nor the combination) constitutes an instruction to maximize a parameter—to describe a program of this kind as “wanting” something is a category error, and there is no reason to expect its step (2) to be deceptive.
If I understand correctly, his “agent” is your Consequentialist AI, and his “tool” is your Decoupled AI 1.
I can’t make sense of 3. Most predictions’ truth is not contingent on whether they have been erased or not. Stipulations are. Successful action recommendations are stipulations. How does any action recommendation get through that?
I don’t think that low impact is decoupled, and it might be misleading to view them from that frame / lend a false sense of security. The policy is still very much shaped by utility, unlike approval.
Risks: Any decoupled AI “wants” to be coupled. That is, it will converge to the solutions which will actually affect the world, as they will provide highest expected utility.
Reading this made me realize a pretty general idea, which we can call “decoupling action from utility”.
Consequentialist AI: figure out which action, if carried out, would maximize paperclips; then carry out that action.
Decoupled AI 1: figure out which action, if carried out, would maximize paperclips; then print a description of that action.
Decoupled AI 2: figure out which action, if described to a human, would be approved; then carry out that action. (Approval-directed agent)
Decoupled AI 3: figure out which prediction, if erased by a low probability event, would be true; then print that prediction. (Counterfactual oracle)
Any other ideas for “decoupled” AIs, or risks that apply to this approach in general?
(See also the concept of “decoupled RL” from some DeepMind folks.)
I’m all for it! See my post here advocating for research in that direction. I don’t think there’s any known fundamental problem, just that we need to figure out how to do it :-)
For example, with end-to-end training, it’s hard to distinguish the desired “optimize for X then print your plan to the screen” from the super-dangerous “optimize the probability that the human operators thinks they are looking at a plan for X”. (This is probably the kind of inner alignment problem that ofer is referring to.)
I proposed here that maybe we can make this kind of decoupled system with self-supervised learning, although I there are still many open questions about that approach, including the possibility that it’s less safe than it first appears.
Incidentally, I like the idea of mixing Decoupled AI 1 and Decoupled AI 3 to get:
Decoupled AI 5: “Consider the (counterfactual) Earth with no AGIs, and figure out the most probable scenario in which a small group (achieves world peace / cures cancer / whatever), and then describe that scenario.”
I think this one would be likelier to give a reasonable, human-compatible plan on the first try (though you should still ask follow-up questions before actually doing it!).
If the question is about all the risks that apply, rather than special risks with this specific approach, then I’ll note that the usual risks from the inner alignment problem seem to apply.
Yes, decoupling seems to address a broad class of incentive problems in safety, which includes the shutdown problem and various forms of tampering / wireheading. Other examples of decoupling include causal counterfactual agents and counterfactual reward modeling.
Another design: imitation learning. Generally, there seems to be a pattern of: policies which aren’t selected for on the basis of maximizing some kind of return.
The class of non-agent AI’s (not choosing actions based on the predicted resulting utility) seems very broad. We could choose actions alphabetically, or use an expert system representing the outside view, or use a biased/inaccurate model when predicting consequences, or include preferences about which actions are good or bad in themselves.
I don’t think there’s any general failure mode (there are certainly specific ones), but if we condition on this AI being selected by humans, maybe we select something that’s doing enough optimization that it will take a highly-optimizing action like rewriting itself to be an agent.
Design 2̴ 1 may happen to reply “Convince the director to undecouple the AI design by telling him <convincing argument>.” which could convince the operator that reads it and therefore fail as 3̴ 2 fails.
Design 2̴ 1 may also model distant superintelligences that break out of the box by predictably maximizing paperclips iff we draw a runic circle that, when printed as a plan, convinces the reader or hacks the computer.
Why would such “dual purpose” plans have higher approval value than some other plan designed purely to maximize approval?
Oh, damn it, I mixed up the designs. Edited.
Can’t quite read your edit, did you mean 3?
Yeah, then I agree with both points. Sneaky!
FWIW, this reminds me of Holden Karnofsky’s formulation of Tool AI (from his 2012 post, Thoughts on the Singularity Institute):
If I understand correctly, his “agent” is your Consequentialist AI, and his “tool” is your Decoupled AI 1.
I can’t make sense of 3. Most predictions’ truth is not contingent on whether they have been erased or not. Stipulations are. Successful action recommendations are stipulations. How does any action recommendation get through that?
You can read about counterfactual oracles in this paper. Stuart also ran a contest on LW about them.
Decoupled AI 4: figure out which action will reach the goal, without affecting outside world (low-impact AI)
I don’t think that low impact is decoupled, and it might be misleading to view them from that frame / lend a false sense of security. The policy is still very much shaped by utility, unlike approval.
Risks: Any decoupled AI “wants” to be coupled. That is, it will converge to the solutions which will actually affect the world, as they will provide highest expected utility.
I agree for 3, but not for 2.