(Eli’s personal notes, mostly for his own understanding. Feel free to respond if you want.)
If you have a big aggregate of agents that understands something the little local agent doesn’t understand, the big aggregate doesn’t inherit alignment from the little agents. Searle’s Chinese Room can understand Chinese even if the person inside it doesn’t understand Chinese, and this correspondingly implies, by default, that the person inside the Chinese Room is powerless to express their own taste in restaurant orders.
vs.
The argument for alignment isn’t that “a system made of aligned neurons is aligned.” Unalignment isn’t a thing that magically happens; it’s the result of specific optimization pressures in the system that create trouble. My goal is to (a) first construct weaker agents who aren’t internally doing problematic optimization, (b) put them together in a way that improves capability without doing other problematic optimization, (c) iterate that process.
Both these views make some sense to me.
One question that comes to mind is this: do regular bureaucracies exhibit unaligned behavior? It seems like the answer is broadly “yes, but only moderately unaligned.” It seems like actual companies are an example of how one can get superintelligent output from humanly intelligent parts, that doesn’t seem well described as the parts in aggregate “effectively...running AGI code that they don’t understand.” And they don’t exhibit wildly unaligned behavior because the executives of the company do a have a pretty good idea of the whole picture.
(Of course, those executives don’t have much detail in their overview. They need to rely on middle managers to make sure that nothing really bad is happening in the individual departments and individual teams. But it seems like there’s not much that small teams of humans can do, because their power is pretty limited. The same would be true in Paul’s proposal.)
It seems to me that Eliezer’s point is broadly correct in the sense that a series of small agents can be organized in such a way that that they are effectively emulating an unaligned superintelligence that they don’t understand. But not all aggregates of small agents have this property, particularly if they are arranged in a hierarchy where the top levels have a high level view of the planning execution.
(Eli’s personal notes, mostly for his own understanding. Feel free to respond if you want.)
vs.
Both these views make some sense to me.
One question that comes to mind is this: do regular bureaucracies exhibit unaligned behavior? It seems like the answer is broadly “yes, but only moderately unaligned.” It seems like actual companies are an example of how one can get superintelligent output from humanly intelligent parts, that doesn’t seem well described as the parts in aggregate “effectively...running AGI code that they don’t understand.” And they don’t exhibit wildly unaligned behavior because the executives of the company do a have a pretty good idea of the whole picture.
(Of course, those executives don’t have much detail in their overview. They need to rely on middle managers to make sure that nothing really bad is happening in the individual departments and individual teams. But it seems like there’s not much that small teams of humans can do, because their power is pretty limited. The same would be true in Paul’s proposal.)
It seems to me that Eliezer’s point is broadly correct in the sense that a series of small agents can be organized in such a way that that they are effectively emulating an unaligned superintelligence that they don’t understand. But not all aggregates of small agents have this property, particularly if they are arranged in a hierarchy where the top levels have a high level view of the planning execution.