So, how do you actually do assumption #3? This seems surprisingly tricky. For example, if nothing matters to the agent in the case where it’s stopped, maybe it takes actions that assume it won’t be stopped because they’re the only ones that have any expected utility.
Hm, I think the obvious way is to assume the agent has a transparent ontology, so that we can specify at the start that it only cares about the world for k timesteps. This even gives it an incentive to retain this safeguard when self-modifying—if it planned for the far future it would probably do worse in the near future. But it also highlights an issue—the AI isn’t the same every time, and may still surprise us. Even if you run it from the same seed, you have to let the AI know its time horizon so it can make decisions that depend on it, and this may cause discontinuous jumps. For example, if an AI runs a search one more timestep than its predecessor, and the search succeeds on that step, maybe now it’s profitable to hack the humans when it wasn’t even aware of the possibility before.
Thanks for your comment, I think I’m a little confused about what it would mean to actually satisfy this assumption.
It seems to me that many current algorithms, for example, a rainbowDQN agent, would satisfy assumption 3? But like I said I’m super confused about anything resembling questions about self-awareness/naturalisation.
So, how do you actually do assumption #3? This seems surprisingly tricky. For example, if nothing matters to the agent in the case where it’s stopped, maybe it takes actions that assume it won’t be stopped because they’re the only ones that have any expected utility.
Hm, I think the obvious way is to assume the agent has a transparent ontology, so that we can specify at the start that it only cares about the world for k timesteps. This even gives it an incentive to retain this safeguard when self-modifying—if it planned for the far future it would probably do worse in the near future. But it also highlights an issue—the AI isn’t the same every time, and may still surprise us. Even if you run it from the same seed, you have to let the AI know its time horizon so it can make decisions that depend on it, and this may cause discontinuous jumps. For example, if an AI runs a search one more timestep than its predecessor, and the search succeeds on that step, maybe now it’s profitable to hack the humans when it wasn’t even aware of the possibility before.
Thanks for your comment, I think I’m a little confused about what it would mean to actually satisfy this assumption.
It seems to me that many current algorithms, for example, a rainbowDQN agent, would satisfy assumption 3? But like I said I’m super confused about anything resembling questions about self-awareness/naturalisation.