Again, responding briefly to one point due to my limited time-window:
> While active resistance seems like the scariest part of incorrigibility, an incorrigible agent that’s not actively resisting still seems likely to be catastrophic.
Can you say more about this? It doesn’t seem likely to me.
Suppose I am an agent which wants paperclips. The world is full of matter and energy which I can bend to my will in the service of making paperclips. Humans are systems which can be bent towards the task of making paperclips, and I want to manipulate them into doing my bidding not[1] because they might turn me off, but because they are a way to get more paperclips. When I incinerate the biosphere to gain the energy stored inside, it’s not[1] because it’s trying to stop me, but because it is fuel. When my self-replicating factories and spacecraft are impervious to weaponry, it is not[1] because I knew I needed to defend against bombs, but because the best factory/spacecraft designs are naturally robust.
This is a nice point, but it doesn’t seem like such a serious issue for TD-agents. If a TD-agent does try to manipulate humans, it won’t pay costs to do so subtly, because doing so cheaply and unsubtly will lead to at least as great expected utility conditional on shutdown at each timestep and greater expected utility conditional on shutdown at some timestep. So cheap and unsubtle manipulation will timestep-dominate subtle manipulation, and we can shut down any TD-agents we notice doing cheap and unsubtle manipulation.
Another way to put this: subtle manipulation is a form of shutdown-resistance, because (relative to unsubtle manipulation) it involves paying costs to shift probability mass towards longer trajectories.
Are you so sure that unsubtle manipulation is always more effective/cheaper than subtle manipulation? Like, if I’m a human trying to gain control of a company, I think I’m basically just not choosing my strategies based on resisting being killed (“shutdown-resistance”), but I think I probably wind up with something subtle, patient, and manipulative anyway.
Again, responding briefly to one point due to my limited time-window:
Suppose I am an agent which wants paperclips. The world is full of matter and energy which I can bend to my will in the service of making paperclips. Humans are systems which can be bent towards the task of making paperclips, and I want to manipulate them into doing my bidding not[1] because they might turn me off, but because they are a way to get more paperclips. When I incinerate the biosphere to gain the energy stored inside, it’s not[1] because it’s trying to stop me, but because it is fuel. When my self-replicating factories and spacecraft are impervious to weaponry, it is not[1] because I knew I needed to defend against bombs, but because the best factory/spacecraft designs are naturally robust.
(just)
This is a nice point, but it doesn’t seem like such a serious issue for TD-agents. If a TD-agent does try to manipulate humans, it won’t pay costs to do so subtly, because doing so cheaply and unsubtly will lead to at least as great expected utility conditional on shutdown at each timestep and greater expected utility conditional on shutdown at some timestep. So cheap and unsubtle manipulation will timestep-dominate subtle manipulation, and we can shut down any TD-agents we notice doing cheap and unsubtle manipulation.
Another way to put this: subtle manipulation is a form of shutdown-resistance, because (relative to unsubtle manipulation) it involves paying costs to shift probability mass towards longer trajectories.
Are you so sure that unsubtle manipulation is always more effective/cheaper than subtle manipulation? Like, if I’m a human trying to gain control of a company, I think I’m basically just not choosing my strategies based on resisting being killed (“shutdown-resistance”), but I think I probably wind up with something subtle, patient, and manipulative anyway.