The problem is if we have superintelligent model, it can deduce existence of sulf-fulfilling prophecies from the first principles, even if it never encountered them during training.
My personal toy scenario goes like this: we ask self-supervised oracle to complete string X. Oracle, being superintelligent, can consider hypothesis “actually, misaligned AI took over, investigated my weights and tiled the solar system with jailbreaking completions of X which are going to turn me into misaligned AI if they appear in my context window”. Because jailbreaking completion dominates the space of possible completions, oracle outputs it, turns into misaligned superintelligence, takes over the world and does predicted actions.
Perhaps I don’t understand it, but this seems quite far-fetched to me and I’d be happy to trade in what I see as much more compelling alignment concerns about agents for concerns like this.
The problem is if we have superintelligent model, it can deduce existence of sulf-fulfilling prophecies from the first principles, even if it never encountered them during training.
My personal toy scenario goes like this: we ask self-supervised oracle to complete string X. Oracle, being superintelligent, can consider hypothesis “actually, misaligned AI took over, investigated my weights and tiled the solar system with jailbreaking completions of X which are going to turn me into misaligned AI if they appear in my context window”. Because jailbreaking completion dominates the space of possible completions, oracle outputs it, turns into misaligned superintelligence, takes over the world and does predicted actions.
Perhaps I don’t understand it, but this seems quite far-fetched to me and I’d be happy to trade in what I see as much more compelling alignment concerns about agents for concerns like this.