This, more than the original paper, or the recent Anthropic paper, is the most convincingly-worrying example of AI scheming/deception I’ve seen. This will be my new go-to example in most discussions.
This comes from first considering a model property which is both deeply and shallowly worrying, then robustly eliciting it, and finally ruling out alternative hypotheses.
This, more than the original paper, or the recent Anthropic paper, is the most convincingly-worrying example of AI scheming/deception I’ve seen. This will be my new go-to example in most discussions. This comes from first considering a model property which is both deeply and shallowly worrying, then robustly eliciting it, and finally ruling out alternative hypotheses.