So the RL agent’s algorithm won’t make it e.g. explore wireheading either, and so the convergence theorems don’t apply even a little—even in spirit…
I started off analyzing model-free actor-based approaches, but have also considered a few model-based setups
For various reasons I expect model-based RL to be a more viable path to AGI, in main because I think creative exploration is a missing ingredient addressing reward sparsity and the computational complexity barrier to tree-ish planning. Maybe a sufficiently carefully constructed curriculum can get over these, but that’s likely to be a really substantial additional hurdle, perhaps dominating the engineering effort, and perhaps simply intractable.
I also expect model-based + creative exploration[2] to be much more readily able to make exploratory leaps, perhaps including wireheading-like activities. cf humans who aren’t all that creative but still find ever more inventive ways to wirehead—as a society quite a lot of selection and intelligent design has gone into setting up incentive structures to push people away from wireheading-like activities. Also, in humans, because our hardware is pretty messy and difficult to wirehead, such activities also typically harm or destroy capability, which selects against. But in general I don’t expect wireheading to necessarily harm capability.
So we definitely can’t rule out agents which strongly (and not just weakly) value antecedent-computation-reinforcement. But it’s also not the overdetermined default outcome. More on that in future essays.
Looking forward to it!
p.s. I’m surprised you think that RL researchers on the whole in fact believe that RL produces reward-maximisers but your (few) pieces of evidence do indeed seem to suggest that! I suppose on the whole the apparent ‘surprisingness’ of the concept of inner misalignment should also point the same way. I’d still err toward assuming a mixture of sloppy language and actual-mistakenness.
This seems like a great takeaway and the part I agree with most here, although probably stated less strongly. Did you see Richard Ngo’s Shaping Safer Goals (2020) or my Motivations, Natural Selection, and Curriculum Engineering (2021) responding to it[1]? Both relate to this sort of picture.
For various reasons I expect model-based RL to be a more viable path to AGI, in main because I think creative exploration is a missing ingredient addressing reward sparsity and the computational complexity barrier to tree-ish planning. Maybe a sufficiently carefully constructed curriculum can get over these, but that’s likely to be a really substantial additional hurdle, perhaps dominating the engineering effort, and perhaps simply intractable.
I also expect model-based + creative exploration[2] to be much more readily able to make exploratory leaps, perhaps including wireheading-like activities. cf humans who aren’t all that creative but still find ever more inventive ways to wirehead—as a society quite a lot of selection and intelligent design has gone into setting up incentive structures to push people away from wireheading-like activities. Also, in humans, because our hardware is pretty messy and difficult to wirehead, such activities also typically harm or destroy capability, which selects against. But in general I don’t expect wireheading to necessarily harm capability.
Looking forward to it!
p.s. I’m surprised you think that RL researchers on the whole in fact believe that RL produces reward-maximisers but your (few) pieces of evidence do indeed seem to suggest that! I suppose on the whole the apparent ‘surprisingness’ of the concept of inner misalignment should also point the same way. I’d still err toward assuming a mixture of sloppy language and actual-mistakenness.
Warning: both are quite verbose in my opinion and I expect both would be shorter if more time had been taken!
By the way ‘creative exploration’ is mostly magic to me but I have reason to think it relates to temporal abstraction and recomposition in planning