J Bostock comments on Ablations for “Frontier Models are Capable of In-context Scheming”

J Bostock 19 Dec 2024 0:23 UTC
9 points
1
This, more than the original paper, or the recent Anthropic paper, is the most convincingly-worrying example of AI scheming/deception I’ve seen. This will be my new go-to example in most discussions. This comes from first considering a model property which is both deeply and shallowly worrying, then robustly eliciting it, and finally ruling out alternative hypotheses.