I am happy to take a “non-worst-case” empirical perspective in studying this problem. In particular, I suspect it will be very helpful – and possibly necessary – to use incidental empirical properties of deep learning systems, which often have a surprising amount of useful emergent structure (as I will discuss more under “Intuitions”).
One reason I feel sad about depending on incidental properties is that it likely implies the solution isn’t robust enough to optimize against. This is a key desiderata in an ELK solution. I imagine this optimization would typically come from 2 sources:
Directly trying to train against the ELK outputs (IMO quite important)
The AI gaming the ELK solution by manipulating it’s thoughts (IMO not very important, but it depends on exactly how non-robust the solution is)
That’s not to say non-robust ELK approaches aren’t useful. So as long as you never apply too many bits of optimization against the approach it should remain a useful test for other techniques.
It also seems plausible that work like this could eventually lead to a robust solution.
I agree this proposal wouldn’t be robust enough to optimize against as-stated, but this doesn’t bother me much for a couple reasons:
This seems like a very natural sub-problem that captures a large fraction of the difficulty of the full problem while being more tractable. Even just from a general research perspective that seems quite appealing—at a minimum, I think solving this would teach us a lot.
It seems like even without optimization this could give us access to something like aligned superintelligent oracle models. I think this would represent significant progress and would be a very useful tool for more robust solutions.
I have some more detailed thoughts about how we could extend this to a full/robust solution (though I’ve also deliberately thought much less about that than how to solve this sub-problem), but I don’t think that’s really the point—this already seems like a pretty robustly good problem to work on to me.
(But I do think this is an important point that I forgot to mention, so thanks for bringing it up!)
One reason I feel sad about depending on incidental properties is that it likely implies the solution isn’t robust enough to optimize against. This is a key desiderata in an ELK solution. I imagine this optimization would typically come from 2 sources:
Directly trying to train against the ELK outputs (IMO quite important)
The AI gaming the ELK solution by manipulating it’s thoughts (IMO not very important, but it depends on exactly how non-robust the solution is)
That’s not to say non-robust ELK approaches aren’t useful. So as long as you never apply too many bits of optimization against the approach it should remain a useful test for other techniques.
It also seems plausible that work like this could eventually lead to a robust solution.
I agree this proposal wouldn’t be robust enough to optimize against as-stated, but this doesn’t bother me much for a couple reasons:
This seems like a very natural sub-problem that captures a large fraction of the difficulty of the full problem while being more tractable. Even just from a general research perspective that seems quite appealing—at a minimum, I think solving this would teach us a lot.
It seems like even without optimization this could give us access to something like aligned superintelligent oracle models. I think this would represent significant progress and would be a very useful tool for more robust solutions.
I have some more detailed thoughts about how we could extend this to a full/robust solution (though I’ve also deliberately thought much less about that than how to solve this sub-problem), but I don’t think that’s really the point—this already seems like a pretty robustly good problem to work on to me.
(But I do think this is an important point that I forgot to mention, so thanks for bringing it up!)