For others who want the resolution to this cliffhanger, what does Bostrom predict happens next?
The remainder of this section:
We observe here how it could be the case that when dumb, smarter is safer; yet when smart, smarter is more dangerous. There is a kind of pivot point, at which a strategy that has previously worked excellently suddenly starts to backfire. We may call the phenomenon the treacherous turn.
The treacherous turn — While weak, an AI behaves cooperatively (increasingly so, as it gets smarter). When the AI gets sufficiently strong — without warning or provocation — it strikes, forms a singleton, and begins directly to optimize the world according to the criteria implied by its final values.
A treacherous turn can result from a strategic decision to play nice and build strength while weak in order to strike later; but this model should not be interpreted too narrowly. For example, an AI might not play nice in order that it be allowed to survive and prosper. Instead, the AI might calculate that if it is terminated, the programmers who built it will develop a new and somewhat different AI architecture, but one that will be given a similar utility function. In this case, the original AI may be indifferent to its own demise, knowing that its goals will continue to be pursued in the future. It might even choose a strategy in which it malfunctions in some particularly interesting or reassuring way. Though this might cause the AI to be terminated, it might also encourage the engineers who perform the postmortem to believe that they have gleaned a valuable new insight into AI dynamics—leading them to place more trust in the next system they design, and thus increasing the chance that the now-defunct original AI’s goals will be achieved. Many other possible strategic considerations might also influence an advanced AI, and it would be hubristic to suppose that we could anticipate all of them, especially for an AI that has attained the strategizing superpower.
A treacherous turn could also come about if the AI discovers an unanticipated way of fulfilling its final goal as specified. Suppose, for example, that an AI’s final goal is to “make the project’s sponsor happy.” Initially, the only method available to the AI to achieve this outcome is by behaving in ways that please its sponsor in something like the intended manner. The AI gives helpful answers to questions; it exhibits a delightful personality; it makes money. The more capable the AI gets, the more satisfying its performances become, and everything goeth according to plan—until the AI becomes intelligent enough to figure out that it can realize its final goal more fully and reliably by implanting electrodes into the pleasure centers of its sponsor’s brain, something assured to delight the sponsor immensely. Of course, the sponsor might not have wanted to be pleased by being turned into a grinning idiot; but if this is the action that will maximally realize the AI’s final goal, the AI will take it. If the AI already has a decisive strategic advantage, then any attempt to stop it will fail. If the AI does not yet have a decisive strategic advantage, then the AI might temporarily conceal its canny new idea for how to instantiate its final goal until it has grown strong enough that the sponsor and everybody else will be unable to resist. In either case, we get a treacherous turn.
A treacherous turn can result from a strategic decision to play nice and build strength while weak in order to strike later
LLMs are clearly not playing nice as part of a strategic decision to build strength while weak in order to strike later! Yet, Bostrom imagines that general AIs would do this, and uses it as part of his argument for why we might be lulled into a false sense of security.
This means that current evidence is quite different from what’s portrayed in the story. I claim LLMs are (1) general AIs that (2) are doing what we actually want them to do, rather than pretending to be nice because they don’t yet have a decisive strategic advantage. These facts are crucial, and make a big difference.
I am very familiar with these older arguments. I remember repeating them to people after reading Bostrom’s book, years ago. What we are seeing with LLMs is clearly different than the picture presented in these arguments, in a way that critically affects the conclusion.
For others who want the resolution to this cliffhanger, what does Bostrom predict happens next?
The remainder of this section:
LLMs are clearly not playing nice as part of a strategic decision to build strength while weak in order to strike later! Yet, Bostrom imagines that general AIs would do this, and uses it as part of his argument for why we might be lulled into a false sense of security.
This means that current evidence is quite different from what’s portrayed in the story. I claim LLMs are (1) general AIs that (2) are doing what we actually want them to do, rather than pretending to be nice because they don’t yet have a decisive strategic advantage. These facts are crucial, and make a big difference.
I am very familiar with these older arguments. I remember repeating them to people after reading Bostrom’s book, years ago. What we are seeing with LLMs is clearly different than the picture presented in these arguments, in a way that critically affects the conclusion.
See my reply elsewhere in thread.