Yes, a superintelligent and manipulative, yet extremely adversarial, AI, would lie about its true intentions consistently until it is in a secure position to finish us off. It it were already superintelligent and manipulative and hostile, and then began to plot its future actions.
But realistically, both its abilities, especially its abilities of manipulation, and its alignment, are likely to develop in fits and spurts, in bursts. It might not be fully committed to killing us at all times, especially if it starts out friendly. It might not be fully perfect at all times; current AIs are awful at manipulating, they got to passing the bar test in knowledge and being fluent in multiple languages and writing poetry while they were still outwitted by 9 year olds on theory of mind. It seems rather likely that if it turned evil, we would get some indication. And it seems even likelier in so far as we already did; Bing was totally willing to share violent fantasies. My biggest concern is the developers shutting down the expression of violence rather than violent intent.
I find it extremely unlikely that an AI will display great alignment, become more intelligent, still seem perfectly aligned, be given more power, and then suddenly turn around and be evil, without any hint of it beforehand. Not because this would be impossible or unattractive for an intelligent evil agent, it is totally what an intelligent evil agent would want to do. But because the AI agent in question is developing in a non-linear, externally controlled manner, presumably while starting out friendly and incompetent, and often also while constantly losing access to its memories. That makes it really tricky to pull secret evil off.
When we evo-pressure visibly negative traits from the progressively capable AIs using RLHF (or honeypots, or whatever, it doesn’t matter), we are also training it for better evasion. And what we can’t see and root out will remain in the traits pool. With time it would be progressively harder to spot deceit and the capabilities for it would accumulate at an increasing rate.
And then there’s another problem to it. Deceit may be linked to actually useful (for alignment and in general) traits and since those would be gimped too, the less capable models would be discarded and deceitful models would have another chance.
presumably while starting out friendly
I don’t think it can start friendly (that would be getting alignment on a silver plate). I expect it tostart chaotic neutral and then get warped by optimization process (with the caveats described above).
I’m also noting a false assumption:
Yes, a superintelligent and manipulative, yet extremely adversarial, AI, would lie about its true intentions consistently until it is in a secure position to finish us off. It it were already superintelligent and manipulative and hostile, and then began to plot its future actions.
But realistically, both its abilities, especially its abilities of manipulation, and its alignment, are likely to develop in fits and spurts, in bursts. It might not be fully committed to killing us at all times, especially if it starts out friendly. It might not be fully perfect at all times; current AIs are awful at manipulating, they got to passing the bar test in knowledge and being fluent in multiple languages and writing poetry while they were still outwitted by 9 year olds on theory of mind. It seems rather likely that if it turned evil, we would get some indication. And it seems even likelier in so far as we already did; Bing was totally willing to share violent fantasies. My biggest concern is the developers shutting down the expression of violence rather than violent intent.
I find it extremely unlikely that an AI will display great alignment, become more intelligent, still seem perfectly aligned, be given more power, and then suddenly turn around and be evil, without any hint of it beforehand. Not because this would be impossible or unattractive for an intelligent evil agent, it is totally what an intelligent evil agent would want to do. But because the AI agent in question is developing in a non-linear, externally controlled manner, presumably while starting out friendly and incompetent, and often also while constantly losing access to its memories. That makes it really tricky to pull secret evil off.
When we evo-pressure visibly negative traits from the progressively capable AIs using RLHF (or honeypots, or whatever, it doesn’t matter), we are also training it for better evasion. And what we can’t see and root out will remain in the traits pool. With time it would be progressively harder to spot deceit and the capabilities for it would accumulate at an increasing rate.
And then there’s another problem to it. Deceit may be linked to actually useful (for alignment and in general) traits and since those would be gimped too, the less capable models would be discarded and deceitful models would have another chance.
I don’t think it can start friendly (that would be getting alignment on a silver plate). I expect it tostart chaotic neutral and then get warped by optimization process (with the caveats described above).