This is a nice point, but it doesn’t seem like such a serious issue for TD-agents. If a TD-agent does try to manipulate humans, it won’t pay costs to do so subtly, because doing so cheaply and unsubtly will lead to at least as great expected utility conditional on shutdown at each timestep and greater expected utility conditional on shutdown at some timestep. So cheap and unsubtle manipulation will timestep-dominate subtle manipulation, and we can shut down any TD-agents we notice doing cheap and unsubtle manipulation.
Another way to put this: subtle manipulation is a form of shutdown-resistance, because (relative to unsubtle manipulation) it involves paying costs to shift probability mass towards longer trajectories.
Are you so sure that unsubtle manipulation is always more effective/cheaper than subtle manipulation? Like, if I’m a human trying to gain control of a company, I think I’m basically just not choosing my strategies based on resisting being killed (“shutdown-resistance”), but I think I probably wind up with something subtle, patient, and manipulative anyway.
This is a nice point, but it doesn’t seem like such a serious issue for TD-agents. If a TD-agent does try to manipulate humans, it won’t pay costs to do so subtly, because doing so cheaply and unsubtly will lead to at least as great expected utility conditional on shutdown at each timestep and greater expected utility conditional on shutdown at some timestep. So cheap and unsubtle manipulation will timestep-dominate subtle manipulation, and we can shut down any TD-agents we notice doing cheap and unsubtle manipulation.
Another way to put this: subtle manipulation is a form of shutdown-resistance, because (relative to unsubtle manipulation) it involves paying costs to shift probability mass towards longer trajectories.
Are you so sure that unsubtle manipulation is always more effective/cheaper than subtle manipulation? Like, if I’m a human trying to gain control of a company, I think I’m basically just not choosing my strategies based on resisting being killed (“shutdown-resistance”), but I think I probably wind up with something subtle, patient, and manipulative anyway.