Roko comments on Turing-Test-Passing AI implies Aligned AI

Roko 31 Dec 2024 22:28 UTC
1 point
0

a misaligned AI might be incentivized to behave identically to a helpful human until it can safely pursue it’s true objective

It could, but some humans might also do that. Indeed, humans do that kind of thing all the time.

AIs might behave similar to humans in typical situations but diverge from human norms when they become superintelligent.

But they wouldn’t ‘become’ superintelligent because there would be no extra training once the AI had finished training. And OOD inputs won’t produce different outputs if the underlying function is the same. Given a complexity prior and enough data, ML algos will converge on the same function as the human brain uses.

The AIs might be perfect human substitutes individually but result in unexpected emergent behavior that can’t be easily forseen in advance when acting as a group. To use an analogy, adding grains of sand to a pile one by one seems stable until the pile collapses in a mini-avalanche.

The behavior will follow the same probability distribution since the distribution of outputs for a given AI is the same as for the human it is a functional copy of. Think of a thousand piles of sand from the same well-mixed batch—each of them is slightly different, but any one pile falls within the distribution.