Weather this works or not is going to depend heavily on what W looks like.
Given E(W|π0)−E(W|π∗V)>C≤0 , i.e. W∈W+, what does this say about E(W|π) ?
The answer depends on the amount of mutual information between W|π0, W|π∗V and W|π . Unfortunately the the more generic W is, (i.e. any function is possible) the less mutual information there will be. Therefore, unless we know some structure about W , the restriction to W+ is not going to do much. The agent will just find a very different policy π∗V′ that also actives very high V in some very Goodharty way, but does not get penalized because low W value for on π∗V is not correlated with low W value on π∗V′ .
This could possibly be fixed by adding assumptions of the type E(U|π0)>E(W|π) for any π that does too well on V. That might yield something interesting, or it might just be a very complicated way of specifying as satisfiser, I don’t know.
Weather this works or not is going to depend heavily on what W looks like.
Given E(W|π0)−E(W|π∗V)>C≤0 , i.e. W∈W+, what does this say about E(W|π) ?
The answer depends on the amount of mutual information between W|π0, W|π∗V and W|π . Unfortunately the the more generic W is, (i.e. any function is possible) the less mutual information there will be. Therefore, unless we know some structure about W , the restriction to W+ is not going to do much. The agent will just find a very different policy π∗V′ that also actives very high V in some very Goodharty way, but does not get penalized because low W value for on π∗V is not correlated with low W value on π∗V′ .
This could possibly be fixed by adding assumptions of the type E(U|π0)>E(W|π) for any π that does too well on V. That might yield something interesting, or it might just be a very complicated way of specifying as satisfiser, I don’t know.
You can’t get too much work from a single bit of information ^_^