I’ve made a career of being good at finding ‘weird’ failure modes[1]. I won’t give further details about said career here.
=====
In general: look for the perverse incentives. Especially in competitive settings, or where money is on the line.
In this case, the actual target is something along the lines of ‘teach the judge the basics of alignment theory’, and the metric is ‘say things that the judge believes is good and interesting within the next minute’.
Well… what happens when the judge has a subtle[2] misconception? Seekers are incentivized to go along with it rather than fight it. This helps the metric but hurts the underlying target.
What happens when there’s a single step in a knowledge chain that takes more than a minute to be interesting? Seekers are incentivized to ignore that knowledge chain. This helps the metric but hurts the underlying target.
Of course, I’m firmly on the side of having found >10 of the last 5 failures. Things wouldn’t be great if everyone was me, but having one of me on a team can work reasonably well.
Be careful that this doesn’t devolve into… I’m not going to say clickbait, because it’s a little longer timescale than that. Flashes in the pan?
Things that sound great for the first minute or two but fall apart on deeper inspection.
I’d be interested in you sharing any reasons why you think this might fall apart, e.g. any insights you’ve gained from deeper inspection.
I’ve made a career of being good at finding ‘weird’ failure modes[1]. I won’t give further details about said career here.
=====
In general: look for the perverse incentives. Especially in competitive settings, or where money is on the line.
In this case, the actual target is something along the lines of ‘teach the judge the basics of alignment theory’, and the metric is ‘say things that the judge believes is good and interesting within the next minute’.
Well… what happens when the judge has a subtle[2] misconception? Seekers are incentivized to go along with it rather than fight it. This helps the metric but hurts the underlying target.
What happens when there’s a single step in a knowledge chain that takes more than a minute to be interesting? Seekers are incentivized to ignore that knowledge chain. This helps the metric but hurts the underlying target.
Etc.
Of course, I’m firmly on the side of having found >10 of the last 5 failures. Things wouldn’t be great if everyone was me, but having one of me on a team can work reasonably well.
Read: would take longer than a minute to get the judge to realize that it was a problem at all.