I guess I’m not clear what the theta is for (maybe I missed something, in which case I apologize). Is there one initial action: how close it goes? And it’s trained to maximize an evaluation function for its proximity, with just theta being the parameter?
That assumption is doing a lot of work, it’s not clear what is packed into that, and it may not be sufficient to prove the argument.
Well, my reasoning isn’t publicly available yet, but this is in fact sufficient, and the assumption can be formalized. For any MDP, there is a discount rate γ, and for each reward function there exists an optimal policy π∗ for that discount rate. I’m claiming that given γ sufficiently close to 1, optimal policies likely end up gaining power as an instrumentally convergent subgoal within that MDP.
(All of this can be formally defined in the right way. If you want the proof, you’ll need to hold tight for a while)
I guess I’m not clear what the theta is for (maybe I missed something, in which case I apologize). Is there one initial action: how close it goes? And it’s trained to maximize an evaluation function for its proximity, with just theta being the parameter?
Well, my reasoning isn’t publicly available yet, but this is in fact sufficient, and the assumption can be formalized. For any MDP, there is a discount rate γ, and for each reward function there exists an optimal policy π∗ for that discount rate. I’m claiming that given γ sufficiently close to 1, optimal policies likely end up gaining power as an instrumentally convergent subgoal within that MDP.
(All of this can be formally defined in the right way. If you want the proof, you’ll need to hold tight for a while)