Yeah I think this is Evan’s view. This is from his research agenda (I’m guessing you might have already seen this given your comment but I’ll add it here for reference anyway in case others are interested)
I suspect we can in fact design transparency metrics that are robust to Goodharting when the only optimization pressure being applied to them is coming from SGD, but cease to be robust if the model itself starts actively trying to trick them.
And I think his view on deception through inner optimisation pressure is that this is something we’ll basically be powerless to deal with once it happens, so the only way to make sure it doesn’t happen it to chart a safe path through model space which never enters the deceptive region in the first place.
Yeah I think this is Evan’s view. This is from his research agenda (I’m guessing you might have already seen this given your comment but I’ll add it here for reference anyway in case others are interested)
And I think his view on deception through inner optimisation pressure is that this is something we’ll basically be powerless to deal with once it happens, so the only way to make sure it doesn’t happen it to chart a safe path through model space which never enters the deceptive region in the first place.