Wait, so PreDCA solves inner-misalignment by just … assuming that “we will later have an ideal learning theory with provable guarantees”?
By the claim “PreDCA solves inner-misalignment” as implied by the original protocol / distillation posts, I thought it somehow overcame the core problem of demons-from-imperfect-search. But it seems like the protocol already starts with an assumption of “demons-from-imperfect-search won’t be a problem because of amazing theory” and instead tackles a special instantiation of inner-misalignment that happens because of the nature of the protocol itself (i.e. simulation hypotheses due to simplicity bias + assuming an ideal/perfect search or learning theory).
If my understanding is correct, I think the implication regarding inner-misalignment is misleading because it PreDCA is operating at a whole different level of abstraction/problem-level than most of the discourse around inner-misalignment
I share this intuition that the solution as stated is underwhelming. But from my perspective that’s just because that key central piece is missing, and this wasn’t adequately communicated in the available public resources about PreDCA (even if it was stressed that it’s a work in progress). I guess this situation doesn’t look as worrisome to Vanessa simply because she has a clearer picture of that central piece, or good motives to believe it will be achievable, which she hasn’t yet made public. Of course, while this is the case we should treat optimism with suspicion.
Also, let me note that my a priori understanding of the situation is not
let’s suppose amazing theory will solve imperfect search, and then tackle the other inner misalignment directly stemming from our protocol
but more like
given our protocol, we have good mathematical reasons to believe it will be very hard for an inner optimizer to arise without manipulating hypothesis update. We will use amazing theory to find a concrete learning setup and prove/conjecture that said manipulation is not possible (or that the probability is low). We then hope the remaining inner optimization problems are rare/few/weak enough as for other more straightforward methods to render them highly unlikely (like having the core computing unit explicitly reason about the risk of inner optimization).
Wait, so PreDCA solves inner-misalignment by just … assuming that “we will later have an ideal learning theory with provable guarantees”?
By the claim “PreDCA solves inner-misalignment” as implied by the original protocol / distillation posts, I thought it somehow overcame the core problem of demons-from-imperfect-search. But it seems like the protocol already starts with an assumption of “demons-from-imperfect-search won’t be a problem because of amazing theory” and instead tackles a special instantiation of inner-misalignment that happens because of the nature of the protocol itself (i.e. simulation hypotheses due to simplicity bias + assuming an ideal/perfect search or learning theory).
If my understanding is correct, I think the implication regarding inner-misalignment is misleading because it PreDCA is operating at a whole different level of abstraction/problem-level than most of the discourse around inner-misalignment
I share this intuition that the solution as stated is underwhelming. But from my perspective that’s just because that key central piece is missing, and this wasn’t adequately communicated in the available public resources about PreDCA (even if it was stressed that it’s a work in progress). I guess this situation doesn’t look as worrisome to Vanessa simply because she has a clearer picture of that central piece, or good motives to believe it will be achievable, which she hasn’t yet made public. Of course, while this is the case we should treat optimism with suspicion.
Also, let me note that my a priori understanding of the situation is not
but more like