I’m confused about your first critique. You say the agent has a goal of generating a long-term plan which leads to as much long-term profit as possible; why do you call this a short-term goal, rather than a long-term goal? Do you mean that the agent only takes actions over a short period of time? That’s true in some sense in your example, but I would still characterize this as a long-term goal because success (maximizing profit) is determined by long-term results (which depend on the long-term dynamics of a complex system, etc.).
I see two distinctions between a system like the one I described and a system with long-term goals in the usual sense. First, the goal “write down a plan which, if followed, would lead to long-term profit” is itself a short-term goal which could plausibly be trained up to human-level with a short-term objective function (by training on human-generated predictions). So I think this mechanism avoids the arguments made in claims 4 and 5 of the post for the implausibility of long-term goals (which is my motivation for mentioning it). (I can’t tell if claim 6 was supposed to be addressing long-term goal formation stories like this one.)
Second, the intrinsic goals of the system I described are all short-term (output the text of a plan for a long-term goal; pursue various short-term goals),so the possible alignment failures for such a system might need to be analyzed differently than those of a system with long-term intrinsic goals. For example, such a system might not plan ahead of time to disempower humans (since such disempowerment would come in the long-term, which it doesn’t intrinsically care about). But once it finds that it has enough resources and power to disempower humans, it might then decide to take a catastrophic action, despite not having planned it in advance.
Ah, I think I understand what you meant now. The reward for this agent is not determined by the actual long-term consequences of its action, but by the predicted long-term consequences. In that case, yes, this seems like it might be an interesting middle ground between what we are calling short-term and long-term AIs. Though it still feels closer to a long-term agent to me—I’m confused about why you think it would both (a) not plan ahead of time to disempower humans, and (b) disempower humans when it has the chance. If the predictive model is accurate enough such that it is predictable that disempowering humans would be instrumentally useful, then wouldn’t the model incorporate that into its earlier plans?
Yep, sorry, I’ve probably been pretty unclear with describing this example.
I’m confused about why you think it both (a) not plan ahead of time to disempower humans, and (b) disempower humans when it has the chance. If the predictive model is accurate enough such that it is predictable that disempowering humans would be instrumentally useful, then wouldn’t the model incorporate that into its earlier plans?
For the sake of simplicity, let’s ignore the “execute on arbitrary short-term goals” half of the system I described, and just consider a system which was trained to generate long-term plans that maximize profit by imitating human-generated plans; if we label our training data with information about plan quality, such an agent could plausibly output plans as good as a human expert planner. So the training goal of this system is the short-term goal of minimizing next-token prediction loss during the current episode.
Let’s suppose that this agent is subtly inner misaligned with the (still short-term) inner goal of minimizing the value in the memory address which, during training, represented the current episode’s prediction loss. So if this agent could wirehead by directly setting this number to be very small before the end of its current episode, it would do so. But if it’s not able to do that, it would likely exhibit the behavior we want (outputting plans which mimic its training data), since that’s the best way to get the value in the memory address it cares about to be small. In particular, this system wouldn’t be interested in outputting plans which would manipulate us into setting this value to be small in the future, since it myopically only cares about this memory address’s value during the current episode.
But if at some point the system accumulates lots of resources and power by normal means, it may find that it has the option of e.g. suddenly taking control of the computers that it’s running on and manually setting the value in the memory address it cares about to be small. (I’m assuming here that the system can do things other than just output tokens, e.g. search the internet, contact human experts on the side, etc., so that it could plausibly have a way of taking over its computing cluster without ending the current episode.) So this is a bad action that the system wouldn’t have planned on setting up ahead of time, but would take if it found it was able to.
I’m confused about your first critique. You say the agent has a goal of generating a long-term plan which leads to as much long-term profit as possible; why do you call this a short-term goal, rather than a long-term goal? Do you mean that the agent only takes actions over a short period of time? That’s true in some sense in your example, but I would still characterize this as a long-term goal because success (maximizing profit) is determined by long-term results (which depend on the long-term dynamics of a complex system, etc.).
I see two distinctions between a system like the one I described and a system with long-term goals in the usual sense. First, the goal “write down a plan which, if followed, would lead to long-term profit” is itself a short-term goal which could plausibly be trained up to human-level with a short-term objective function (by training on human-generated predictions). So I think this mechanism avoids the arguments made in claims 4 and 5 of the post for the implausibility of long-term goals (which is my motivation for mentioning it). (I can’t tell if claim 6 was supposed to be addressing long-term goal formation stories like this one.)
Second, the intrinsic goals of the system I described are all short-term (output the text of a plan for a long-term goal; pursue various short-term goals),so the possible alignment failures for such a system might need to be analyzed differently than those of a system with long-term intrinsic goals. For example, such a system might not plan ahead of time to disempower humans (since such disempowerment would come in the long-term, which it doesn’t intrinsically care about). But once it finds that it has enough resources and power to disempower humans, it might then decide to take a catastrophic action, despite not having planned it in advance.
Ah, I think I understand what you meant now. The reward for this agent is not determined by the actual long-term consequences of its action, but by the predicted long-term consequences. In that case, yes, this seems like it might be an interesting middle ground between what we are calling short-term and long-term AIs. Though it still feels closer to a long-term agent to me—I’m confused about why you think it would both (a) not plan ahead of time to disempower humans, and (b) disempower humans when it has the chance. If the predictive model is accurate enough such that it is predictable that disempowering humans would be instrumentally useful, then wouldn’t the model incorporate that into its earlier plans?
Yep, sorry, I’ve probably been pretty unclear with describing this example.
For the sake of simplicity, let’s ignore the “execute on arbitrary short-term goals” half of the system I described, and just consider a system which was trained to generate long-term plans that maximize profit by imitating human-generated plans; if we label our training data with information about plan quality, such an agent could plausibly output plans as good as a human expert planner. So the training goal of this system is the short-term goal of minimizing next-token prediction loss during the current episode.
Let’s suppose that this agent is subtly inner misaligned with the (still short-term) inner goal of minimizing the value in the memory address which, during training, represented the current episode’s prediction loss. So if this agent could wirehead by directly setting this number to be very small before the end of its current episode, it would do so. But if it’s not able to do that, it would likely exhibit the behavior we want (outputting plans which mimic its training data), since that’s the best way to get the value in the memory address it cares about to be small. In particular, this system wouldn’t be interested in outputting plans which would manipulate us into setting this value to be small in the future, since it myopically only cares about this memory address’s value during the current episode.
But if at some point the system accumulates lots of resources and power by normal means, it may find that it has the option of e.g. suddenly taking control of the computers that it’s running on and manually setting the value in the memory address it cares about to be small. (I’m assuming here that the system can do things other than just output tokens, e.g. search the internet, contact human experts on the side, etc., so that it could plausibly have a way of taking over its computing cluster without ending the current episode.) So this is a bad action that the system wouldn’t have planned on setting up ahead of time, but would take if it found it was able to.