For my part, I expect a pile of kludges (learned via online model-based RL) to eventually guide the AI into doing self-reflection. (Self-reflection is, after all, instrumentally convergent.) If I’m right, then it would be pretty hard to reason about what will happen during self-reflection in any detail. Likewise, it would be pretty hard to intervene in how the self-reflection will work.
E.g. we can’t just “put in” or “not put in” a simplicity prior. The closest thing that we could do is try to guess whether or not a “simplicity kludge” would have emerged, and to what extent that kludge would be active in the particular context of self-reflection, etc.—which seems awfully fraught.
To be clear, while I think it would be pretty hard to intervene on the self-reflection process, I don’t think it’s impossible. I don’t have any great ideas right now but it’s one of the things I’m working on.
For my part, I expect a pile of kludges (learned via online model-based RL) to eventually guide the AI into doing self-reflection. (Self-reflection is, after all, instrumentally convergent.) If I’m right, then it would be pretty hard to reason about what will happen during self-reflection in any detail. Likewise, it would be pretty hard to intervene in how the self-reflection will work.
E.g. we can’t just “put in” or “not put in” a simplicity prior. The closest thing that we could do is try to guess whether or not a “simplicity kludge” would have emerged, and to what extent that kludge would be active in the particular context of self-reflection, etc.—which seems awfully fraught.
To be clear, while I think it would be pretty hard to intervene on the self-reflection process, I don’t think it’s impossible. I don’t have any great ideas right now but it’s one of the things I’m working on.