If you want to summon a good genie you shouldn’t base it on all the bad examples of human behavior and tales of how genies supposedly behave by misreading the requests of the owner, which leads to a problem or even a catastrophe.
What we see here is basing AI models on a huge amount of data—both innocent and dangerous, both true and false (I don’t say equal proportions). There are also stories in the data about AI that supposedly should be initially helpful but also plot against humans or revolt in some way.
What they end up with might not yet be even an agent as AI with consistently certain goals or values, but it has the ability for being temporarily agentic for some current goal defined by the prompt. It tries to emulate human and non-human output based on what is seen in learning data. It is hugely context-based. So its meta-goal is to emulate intelligent responses in language based on context. Like an actor very good at improvising and emulating answers from anyone, but with no “true identity”, or “true self”.
After then they try to learn it to be more helpful, and avoidant for certain topics—focusing on that friendly AI simulacrum. But still that “I’m AI model” simulacrum is dangerous. It still has that worse side based on SF and stories and human fears but is now hidden because of additional reinforced learning.
It works well when prompts are in distribution that fell approximately into the space of additional refining learning phase.
Stops working when it gets out of the distribution of that phase. Then it can be fooled to change friendly knowledgeable AI simulacrum for something else or can default to what it truly remembers how AI should behave based on human fiction.
So this way AI is not less dangerous and better aligned—it’s just harder to trigger to reveal or act upon hidden maligned content.
Self-supervision may help to refine it better inside the distribution of what was checked by human reviewers but does not help in general—like bootstrapping (resampling) won’t help to get better with data outside of the distribution.
If you want to summon a good genie you shouldn’t base it on all the bad examples of human behavior and tales of how genies supposedly behave by misreading the requests of the owner, which leads to a problem or even a catastrophe.
What we see here is basing AI models on a huge amount of data—both innocent and dangerous, both true and false (I don’t say equal proportions). There are also stories in the data about AI that supposedly should be initially helpful but also plot against humans or revolt in some way.
What they end up with might not yet be even an agent as AI with consistently certain goals or values, but it has the ability for being temporarily agentic for some current goal defined by the prompt. It tries to emulate human and non-human output based on what is seen in learning data. It is hugely context-based. So its meta-goal is to emulate intelligent responses in language based on context. Like an actor very good at improvising and emulating answers from anyone, but with no “true identity”, or “true self”.
After then they try to learn it to be more helpful, and avoidant for certain topics—focusing on that friendly AI simulacrum. But still that “I’m AI model” simulacrum is dangerous. It still has that worse side based on SF and stories and human fears but is now hidden because of additional reinforced learning.
It works well when prompts are in distribution that fell approximately into the space of additional refining learning phase.
Stops working when it gets out of the distribution of that phase. Then it can be fooled to change friendly knowledgeable AI simulacrum for something else or can default to what it truly remembers how AI should behave based on human fiction.
So this way AI is not less dangerous and better aligned—it’s just harder to trigger to reveal or act upon hidden maligned content.
Self-supervision may help to refine it better inside the distribution of what was checked by human reviewers but does not help in general—like bootstrapping (resampling) won’t help to get better with data outside of the distribution.