I’ve made a career of being good at finding ‘weird’ failure modes[1]. I won’t give further details about said career here.
=====
In general: look for the perverse incentives. Especially in competitive settings, or where money is on the line.
In this case, the actual target is something along the lines of ‘teach the judge the basics of alignment theory’, and the metric is ‘say things that the judge believes is good and interesting within the next minute’.
Well… what happens when the judge has a subtle[2] misconception? Seekers are incentivized to go along with it rather than fight it. This helps the metric but hurts the underlying target.
What happens when there’s a single step in a knowledge chain that takes more than a minute to be interesting? Seekers are incentivized to ignore that knowledge chain. This helps the metric but hurts the underlying target.
Of course, I’m firmly on the side of having found >10 of the last 5 failures. Things wouldn’t be great if everyone was me, but having one of me on a team can work reasonably well.
I’ve made a career of being good at finding ‘weird’ failure modes[1]. I won’t give further details about said career here.
=====
In general: look for the perverse incentives. Especially in competitive settings, or where money is on the line.
In this case, the actual target is something along the lines of ‘teach the judge the basics of alignment theory’, and the metric is ‘say things that the judge believes is good and interesting within the next minute’.
Well… what happens when the judge has a subtle[2] misconception? Seekers are incentivized to go along with it rather than fight it. This helps the metric but hurts the underlying target.
What happens when there’s a single step in a knowledge chain that takes more than a minute to be interesting? Seekers are incentivized to ignore that knowledge chain. This helps the metric but hurts the underlying target.
Etc.
Of course, I’m firmly on the side of having found >10 of the last 5 failures. Things wouldn’t be great if everyone was me, but having one of me on a team can work reasonably well.
Read: would take longer than a minute to get the judge to realize that it was a problem at all.