Yes, if we choose the utility function to make it a CDT agent optimizing for the reward for one step (so particular case of act-based) then it won’t care about future versions of itself nor want to escape.
I agree with the intuition of shutting down to make it episodic, but I am still confused about the causal relationship between “having the rule to shutdown the system” and “having a current timestep maximizer”. For it to really be a “current timestep maximizer” it needs to be in some kind of reward/utility function. Because everything is reset at each timestep, there is no information pointing at “I might get shutdown at the next timestep”.
As for the collecting a dataset and then optimizing for some natural direct effect, I am not familiar enough with Pearl’s work to tell if that would work, but I made some related comments about why there might be some problems in online-learning/”training then testing” here.
Yes, if we choose the utility function to make it a CDT agent optimizing for the reward for one step (so particular case of act-based) then it won’t care about future versions of itself nor want to escape.
I agree with the intuition of shutting down to make it episodic, but I am still confused about the causal relationship between “having the rule to shutdown the system” and “having a current timestep maximizer”. For it to really be a “current timestep maximizer” it needs to be in some kind of reward/utility function. Because everything is reset at each timestep, there is no information pointing at “I might get shutdown at the next timestep”.
As for the collecting a dataset and then optimizing for some natural direct effect, I am not familiar enough with Pearl’s work to tell if that would work, but I made some related comments about why there might be some problems in online-learning/”training then testing” here.