I agree this proposal wouldn’t be robust enough to optimize against as-stated, but this doesn’t bother me much for a couple reasons:
This seems like a very natural sub-problem that captures a large fraction of the difficulty of the full problem while being more tractable. Even just from a general research perspective that seems quite appealing—at a minimum, I think solving this would teach us a lot.
It seems like even without optimization this could give us access to something like aligned superintelligent oracle models. I think this would represent significant progress and would be a very useful tool for more robust solutions.
I have some more detailed thoughts about how we could extend this to a full/robust solution (though I’ve also deliberately thought much less about that than how to solve this sub-problem), but I don’t think that’s really the point—this already seems like a pretty robustly good problem to work on to me.
(But I do think this is an important point that I forgot to mention, so thanks for bringing it up!)
I agree this proposal wouldn’t be robust enough to optimize against as-stated, but this doesn’t bother me much for a couple reasons:
This seems like a very natural sub-problem that captures a large fraction of the difficulty of the full problem while being more tractable. Even just from a general research perspective that seems quite appealing—at a minimum, I think solving this would teach us a lot.
It seems like even without optimization this could give us access to something like aligned superintelligent oracle models. I think this would represent significant progress and would be a very useful tool for more robust solutions.
I have some more detailed thoughts about how we could extend this to a full/robust solution (though I’ve also deliberately thought much less about that than how to solve this sub-problem), but I don’t think that’s really the point—this already seems like a pretty robustly good problem to work on to me.
(But I do think this is an important point that I forgot to mention, so thanks for bringing it up!)