I’m curious why in example 1 the ability to manipulate (“persuade”) is called a capability evaluation, making limited results eligible for the sandbagging label, whereas in example 6 (under the name “blackmail”) it is called an alignment evaluation, making limited results ineligible for that label?
Both examples tuned out the manipulation enough to hide it in testing, with worse results in production. Can someone help me to better learn the nuances we’d like to impose on sandbagging? Benchmark evasion is an area I started getting into only in november.
I’m curious why in example 1 the ability to manipulate (“persuade”) is called a capability evaluation, making limited results eligible for the sandbagging label, whereas in example 6 (under the name “blackmail”) it is called an alignment evaluation, making limited results ineligible for that label?
Both examples tuned out the manipulation enough to hide it in testing, with worse results in production. Can someone help me to better learn the nuances we’d like to impose on sandbagging? Benchmark evasion is an area I started getting into only in november.
Maybe reading this post will help! Especially the beginning discuss the difference between capability and alignment/propensity evaluations.