I agree that you probably need ensembling in addition to these techniques.
At best this technique would produce a system which has a small probability of unacceptable behavior for any input. You’d then need to combine multiple of those to get a system with negligible probability of unacceptable behavior.
I expect you often get this for free, since catastrophe either involves a bunch of different AI systems behaving unacceptably, or a single AI behaving consistently unacceptably across time.
I agree that you probably need ensembling in addition to these techniques.
At best this technique would produce a system which has a small probability of unacceptable behavior for any input. You’d then need to combine multiple of those to get a system with negligible probability of unacceptable behavior.
I expect you often get this for free, since catastrophe either involves a bunch of different AI systems behaving unacceptably, or a single AI behaving consistently unacceptably across time.