“out of distribution” detectors. I am not precisely certain how to implement one of these. I just notice that when we ask a language or art model to generate something from a prompt, or ask it to describe what it means by an “idea”, what it shows us is what it considers “in distribution” for that idea.
This implicitly means that a system could generate a set of outcomes for what it believes the real world will do in response to the machine’s own actions and when the real world outcomes start to diverge wildly from it’s predictions, this should reach a threshold where the AI should shut down.
Safety systems would kick in and these are either dumber AIs or conventional control systems to bring whatever the AI was controlling to a stop, or hand off control to a human.
“out of distribution” detectors. I am not precisely certain how to implement one of these. I just notice that when we ask a language or art model to generate something from a prompt, or ask it to describe what it means by an “idea”, what it shows us is what it considers “in distribution” for that idea.
This implicitly means that a system could generate a set of outcomes for what it believes the real world will do in response to the machine’s own actions and when the real world outcomes start to diverge wildly from it’s predictions, this should reach a threshold where the AI should shut down.
Safety systems would kick in and these are either dumber AIs or conventional control systems to bring whatever the AI was controlling to a stop, or hand off control to a human.
DeepMind has some work on out of distribution detection, for example: https://www.deepmind.com/publications/contrastive-training-for-improved-out-of-distribution-detection I haven’t looked very closely at it yet though.