I think some other approaches also could be in the direction listed in this post:
1) Active boxing or catching treacherous turn: one AI observing behaviour of another AI and predicting when it start to fails.
2) AI tripling: three very similar AI works (independently) on the same problem, and if one of them sufficiently divergent from two others, it turn offs.
I agree that you probably need ensembling in addition to these techniques.
At best this technique would produce a system which has a small probability of unacceptable behavior for any input. You’d then need to combine multiple of those to get a system with negligible probability of unacceptable behavior.
I expect you often get this for free, since catastrophe either involves a bunch of different AI systems behaving unacceptably, or a single AI behaving consistently unacceptably across time.
I think some other approaches also could be in the direction listed in this post:
1) Active boxing or catching treacherous turn: one AI observing behaviour of another AI and predicting when it start to fails.
2) AI tripling: three very similar AI works (independently) on the same problem, and if one of them sufficiently divergent from two others, it turn offs.
I agree that you probably need ensembling in addition to these techniques.
At best this technique would produce a system which has a small probability of unacceptable behavior for any input. You’d then need to combine multiple of those to get a system with negligible probability of unacceptable behavior.
I expect you often get this for free, since catastrophe either involves a bunch of different AI systems behaving unacceptably, or a single AI behaving consistently unacceptably across time.