I think this came up in the previous discussion as well that a AI that was able to competently design a nanofactory could have the capability to manipulate humans as at a high level as well. For example:
Then when the system generalizes well enough to solve domains like “build a nanosystem”—which, I strongly suspect, can’t be solved without imaginative reasoning because we can’t afford to simulate that domain perfectly and do a trillion gradient descent updates on simulated attempts—the kind of actions of thoughts you can detect as bad, that might have provided earlier warning, were trained out of the system by gradient descent; leaving actions and thoughts you can’t detect as bad.
Even within humans, it seems we have people e.g on the autistic spectrum etc, who I can imagine as having the imaginative reasoning & creativity required to design something like a nano-factory(at 2-3 SD above the normal human) while also being 2-3SD below the average human in manipulating other humans. At least it points to those 2 things maybe not being the same general-purpose cognition or using the same “core of generality”
While this is not by-default guaranteed in the first nanosystem-design capable AI system, it seems like it shouldn’t be impossible to do so with more research.
If you want your AGI not to manipulate humans, you can have it (1) unable to manipulate humans, (2) not motivated to manipulate humans.
These are less orthogonal than they seem: an agential AGI can become skilled in domain X by being motivated to get skilled in domain X (and thus spending time learning and practicing X).
I think the thing that happens “by default” is that the AGI has no motivations in particular, one way or the other, about teaching itself how to manipulate humans. But the AGI has motivation to do something (earn money or whatever, depending on how it was programmed), and teaching itself how to manipulate humans is instrumentally useful for almost everything, so then it will do so.
I think what happens in some people with autism is that “teaching myself how to manipulate humans, and then doing so” is not inherently neutral, but rather inherently aversive—so much so that they don’t do it (or do it very little) even when it would in principle be useful for other things that they want to do. That’s not everyone with autism, though. Other people with autism do in fact teach themselves how to manipulate humans reasonably well, I think. And when they do so, I think they do so using their “core of generality”, just like they would teach themselves to fix a car engine. (This is different from neurotypical people, for whom a bunch of specific social instincts are also involved in manipulating people.) (To be clear, this whole paragraph is controversial / according-to-me.)
Back to AGI, I can imagine three approaches to a non-human-manipulating AI
First, we can micromanage the AGI’s cognition. We build some big architecture that includes a “manipulate humans” module, and then we make the “manipulate humans” module return the wrong answers all the time, or just turn it off. The problem is that the AGI also presumably needs some “core of generality” module that the AGI can use to teach itself arbitrary skills that we couldn’t put in the modular architecture, like how to repair a teleportation device that hasn’t been invented yet. What would happen is that the “core of generality” module would just build a new “manipulate humans” capability from scratch. I don’t currently see any way we would prevent that. This problem is analogous to how (I think) some people with autism learn to model people in a way that doesn’t invoke their social instincts.
Second, we could curate the AGI’s data and environment such that it has no awareness that humans exist and are useful to manipulate. This is the Thoughts On Human Models approach. Its issues are: avoiding information leakage is hard, and even if we succeed, I don’t know what we useful / pivotal things we could do with such an AGI.
Third, we can attack the motivation side. We build a detector that lights up when the AGI is manipulating humans, or thinking about manipulating humans, or thinking about thinking about manipulating humans, or whatever. Whenever the detector lights up, it activates the “This Thought Or Activity Is Aversive” mechanism inside the algorithm, which throws out the thought and causes the AGI to think about something else instead. (This mechanism would corresponding to a phasic dopamine pause in the brain, more or less.) I think this approach is more promising, or at least less unpromising. The tricky part is building the “detector”. (Another tricky part is making the AGI motivated to not sabotage this whole mechanism, but maybe we can solve that problem with a second detector!) I do think we can build such a “detector” that mostly works; I’ll talk about this in a forthcoming post. The really hard and maybe impossible part is building a “detector” that always works. The only way I know to build the detector is kinda messy (it involves supervised learning) and seems to come with no guarantees.
If you want your AGI not to manipulate humans, you can have it (1) unable to manipulate humans, (2) not motivated to manipulate humans.
Seems you are mostly considering solution (1) above, except in the last paragraph where you consider a somewhat special version if (2). I believe that Eliezer is saying in the discussion above that solution (1) is a lot more difficult than some people proposing it seem to think. He could be nicer about how he says it, but overall I tend to agree.
In my own alignment work I am mostly looking at solution (2), specifically to create a game-theoretical setup where the agent has a reduced, hopefully even non-existent, motivation to ever manipulate humans. This means you look for a solution where you make interventions on the agent environment, reward function, or other design elements, not on the agent ML system.
Modern mainstream ML research of course almost never considers the design or evaluation of such non-ML-system interventions.
I think this came up in the previous discussion as well that a AI that was able to competently design a nanofactory could have the capability to manipulate humans as at a high level as well. For example:
Even within humans, it seems we have people e.g on the autistic spectrum etc, who I can imagine as having the imaginative reasoning & creativity required to design something like a nano-factory(at 2-3 SD above the normal human) while also being 2-3SD below the average human in manipulating other humans. At least it points to those 2 things maybe not being the same general-purpose cognition or using the same “core of generality”
While this is not by-default guaranteed in the first nanosystem-design capable AI system, it seems like it shouldn’t be impossible to do so with more research.
If you want your AGI not to manipulate humans, you can have it (1) unable to manipulate humans, (2) not motivated to manipulate humans.
These are less orthogonal than they seem: an agential AGI can become skilled in domain X by being motivated to get skilled in domain X (and thus spending time learning and practicing X).
I think the thing that happens “by default” is that the AGI has no motivations in particular, one way or the other, about teaching itself how to manipulate humans. But the AGI has motivation to do something (earn money or whatever, depending on how it was programmed), and teaching itself how to manipulate humans is instrumentally useful for almost everything, so then it will do so.
I think what happens in some people with autism is that “teaching myself how to manipulate humans, and then doing so” is not inherently neutral, but rather inherently aversive—so much so that they don’t do it (or do it very little) even when it would in principle be useful for other things that they want to do. That’s not everyone with autism, though. Other people with autism do in fact teach themselves how to manipulate humans reasonably well, I think. And when they do so, I think they do so using their “core of generality”, just like they would teach themselves to fix a car engine. (This is different from neurotypical people, for whom a bunch of specific social instincts are also involved in manipulating people.) (To be clear, this whole paragraph is controversial / according-to-me.)
Back to AGI, I can imagine three approaches to a non-human-manipulating AI
First, we can micromanage the AGI’s cognition. We build some big architecture that includes a “manipulate humans” module, and then we make the “manipulate humans” module return the wrong answers all the time, or just turn it off. The problem is that the AGI also presumably needs some “core of generality” module that the AGI can use to teach itself arbitrary skills that we couldn’t put in the modular architecture, like how to repair a teleportation device that hasn’t been invented yet. What would happen is that the “core of generality” module would just build a new “manipulate humans” capability from scratch. I don’t currently see any way we would prevent that. This problem is analogous to how (I think) some people with autism learn to model people in a way that doesn’t invoke their social instincts.
Second, we could curate the AGI’s data and environment such that it has no awareness that humans exist and are useful to manipulate. This is the Thoughts On Human Models approach. Its issues are: avoiding information leakage is hard, and even if we succeed, I don’t know what we useful / pivotal things we could do with such an AGI.
Third, we can attack the motivation side. We build a detector that lights up when the AGI is manipulating humans, or thinking about manipulating humans, or thinking about thinking about manipulating humans, or whatever. Whenever the detector lights up, it activates the “This Thought Or Activity Is Aversive” mechanism inside the algorithm, which throws out the thought and causes the AGI to think about something else instead. (This mechanism would corresponding to a phasic dopamine pause in the brain, more or less.) I think this approach is more promising, or at least less unpromising. The tricky part is building the “detector”. (Another tricky part is making the AGI motivated to not sabotage this whole mechanism, but maybe we can solve that problem with a second detector!) I do think we can build such a “detector” that mostly works; I’ll talk about this in a forthcoming post. The really hard and maybe impossible part is building a “detector” that always works. The only way I know to build the detector is kinda messy (it involves supervised learning) and seems to come with no guarantees.
Seems you are mostly considering solution (1) above, except in the last paragraph where you consider a somewhat special version if (2). I believe that Eliezer is saying in the discussion above that solution (1) is a lot more difficult than some people proposing it seem to think. He could be nicer about how he says it, but overall I tend to agree.
In my own alignment work I am mostly looking at solution (2), specifically to create a game-theoretical setup where the agent has a reduced, hopefully even non-existent, motivation to ever manipulate humans. This means you look for a solution where you make interventions on the agent environment, reward function, or other design elements, not on the agent ML system.
Modern mainstream ML research of course almost never considers the design or evaluation of such non-ML-system interventions.
People often refer to this idea as a “lonely engineer”, tho I see only some discussion of it on LW (like here).