cousin_it comments on Thoughts on “Human-Compatible”

cousin_it 10 Oct 2019 10:47 UTC
LW: 22 AF: 8
AF
Reading this made me realize a pretty general idea, which we can call “decoupling action from utility”.

Consequentialist AI: figure out which action, if carried out, would maximize paperclips; then carry out that action.

Decoupled AI 1: figure out which action, if carried out, would maximize paperclips; then print a description of that action.

Decoupled AI 2: figure out which action, if described to a human, would be approved; then carry out that action. (Approval-directed agent)

Decoupled AI 3: figure out which prediction, if erased by a low probability event, would be true; then print that prediction. (Counterfactual oracle)

Any other ideas for “decoupled” AIs, or risks that apply to this approach in general?
What links here?
- Can “Reward Economics” solve AI Alignment? by Q Home (7 Sep 2022 7:58 UTC; 3 points)
- paulfchristiano 10 Oct 2019 16:09 UTC
  LW: 12 AF: 6
  AF Parent
  (See also the concept of “decoupled RL” from some DeepMind folks.)
- Steven Byrnes 12 Oct 2019 1:07 UTC
  5 points
  Parent
  I’m all for it! See my post here advocating for research in that direction. I don’t think there’s any known fundamental problem, just that we need to figure out how to do it :-)
  
  For example, with end-to-end training, it’s hard to distinguish the desired “optimize for X then print your plan to the screen” from the super-dangerous “optimize the probability that the human operators thinks they are looking at a plan for X”. (This is probably the kind of inner alignment problem that ofer is referring to.)
  
  I proposed here that maybe we can make this kind of decoupled system with self-supervised learning, although I there are still many open questions about that approach, including the possibility that it’s less safe than it first appears.
  
  Incidentally, I like the idea of mixing Decoupled AI 1 and Decoupled AI 3 to get:
  
  Decoupled AI 5: “Consider the (counterfactual) Earth with no AGIs, and figure out the most probable scenario in which a small group (achieves world peace / cures cancer / whatever), and then describe that scenario.”
  
  I think this one would be likelier to give a reasonable, human-compatible plan on the first try (though you should still ask follow-up questions before actually doing it!).
- Ofer 10 Oct 2019 19:27 UTC
  LW: 5 AF: 3
  AF Parent
  Any other ideas for “decoupled” AIs, or risks that apply to this approach in general?
  If the question is about all the risks that apply, rather than special risks with this specific approach, then I’ll note that the usual risks from the inner alignment problem seem to apply.
- Vika 21 Oct 2019 14:55 UTC
  LW: 4 AF: 2
  AF Parent
  Yes, decoupling seems to address a broad class of incentive problems in safety, which includes the shutdown problem and various forms of tampering / wireheading. Other examples of decoupling include causal counterfactual agents and counterfactual reward modeling.
- TurnTrout 10 Oct 2019 23:09 UTC
  LW: 4 AF: 2
  AF Parent
  Another design: imitation learning. Generally, there seems to be a pattern of: policies which aren’t selected for on the basis of maximizing some kind of return.
- Charlie Steiner 10 Oct 2019 16:28 UTC
  LW: 4 AF: 2
  AF Parent
  The class of non-agent AI’s (not choosing actions based on the predicted resulting utility) seems very broad. We could choose actions alphabetically, or use an expert system representing the outside view, or use a biased/inaccurate model when predicting consequences, or include preferences about which actions are good or bad in themselves.
  I don’t think there’s any general failure mode (there are certainly specific ones), but if we condition on this AI being selected by humans, maybe we select something that’s doing enough optimization that it will take a highly-optimizing action like rewriting itself to be an agent.
- Gurkenglas 10 Oct 2019 23:30 UTC
  3 points
  Parent
  Design 2̴ 1 may happen to reply “Convince the director to undecouple the AI design by telling him <convincing argument>.” which could convince the operator that reads it and therefore fail as 3̴ 2 fails.
  
  Design 2̴ 1 may also model distant superintelligences that break out of the box by predictably maximizing paperclips iff we draw a runic circle that, when printed as a plan, convinces the reader or hacks the computer.
  - cousin_it 11 Oct 2019 8:03 UTC
    2 points
    Parent
    Why would such “dual purpose” plans have higher approval value than some other plan designed purely to maximize approval?
    - Gurkenglas 11 Oct 2019 8:10 UTC
      1 point
      Parent
      Oh, damn it, I mixed up the designs. Edited.
      - TurnTrout 11 Oct 2019 15:46 UTC
        2 points
        Parent
        Can’t quite read your edit, did you mean 3?
      - cousin_it 11 Oct 2019 9:51 UTC
        2 points
        Parent
        Yeah, then I agree with both points. Sneaky!
- ESRogs 12 Oct 2019 7:37 UTC
  2 points
  Parent
  FWIW, this reminds me of Holden Karnofsky’s formulation of Tool AI (from his 2012 post, Thoughts on the Singularity Institute):
  Another way of putting this is that a “tool” has an underlying instruction set that conceptually looks like: “(1) Calculate which action A would maximize parameter P, based on existing data set D. (2) Summarize this calculation in a user-friendly manner, including what Action A is, what likely intermediate outcomes it would cause, what other actions would result in high values of P, etc.” An “agent,” by contrast, has an underlying instruction set that conceptually looks like: “(1) Calculate which action, A, would maximize parameter P, based on existing data set D. (2) Execute Action A.” In any AI where (1) is separable (by the programmers) as a distinct step, (2) can be set to the “tool” version rather than the “agent” version, and this separability is in fact present with most/all modern software. Note that in the “tool” version, neither step (1) nor step (2) (nor the combination) constitutes an instruction to maximize a parameter—to describe a program of this kind as “wanting” something is a category error, and there is no reason to expect its step (2) to be deceptive.
  If I understand correctly, his “agent” is your Consequentialist AI, and his “tool” is your Decoupled AI 1.
- mako yass 11 Oct 2019 1:09 UTC
  1 point
  Parent
  I can’t make sense of 3. Most predictions’ truth is not contingent on whether they have been erased or not. Stipulations are. Successful action recommendations are stipulations. How does any action recommendation get through that?
  - cousin_it 11 Oct 2019 7:15 UTC
    3 points
    Parent
    You can read about counterfactual oracles in this paper. Stuart also ran a contest on LW about them.
- avturchin 10 Oct 2019 15:55 UTC
  1 point
  Parent
  Decoupled AI 4: figure out which action will reach the goal, without affecting outside world (low-impact AI)
  - TurnTrout 10 Oct 2019 16:09 UTC
    4 points
    Parent
    I don’t think that low impact is decoupled, and it might be misleading to view them from that frame / lend a false sense of security. The policy is still very much shaped by utility, unlike approval.
- avturchin 10 Oct 2019 15:51 UTC
  1 point
  Parent
  Risks: Any decoupled AI “wants” to be coupled. That is, it will converge to the solutions which will actually affect the world, as they will provide highest expected utility.
  - TurnTrout 10 Oct 2019 15:56 UTC
    2 points
    Parent
    I agree for 3, but not for 2.