Thanks for writing this—I found it interesting, thoughtful, and well-written.
One distinction which seems useful to make is between:
long-term goals
long-term planning
long-term capabilities (i.e. the ability to reliably impact the long-term future in a particular way).
It seems to me that this post argues that:
AI systems’ long-term planning won’t be that much better than humans’ (claims 1 and 3).
AI systems won’t develop long-term goals (claims 4, 5, and 6).
Given (1) (and given that both humans and AI systems with long-term goals will have access to systems with the same short-term capabilities), AI systems won’t have much better long-term capabilities than humans + their AI assistants.
Before going on, I’d like to say that point (3) was quite novel and interesting to me—thanks for making it! This bolsters the case for “successfully aligning the AI systems we have now might be sufficient for keeping us safe from future more general AI systems.”
There are two critiques I’d like to make. First, I’d like to push back on claim (2); namely, I’ll posit a mechanism by which an agent with (good but not necessarily superhuman) long-term planning capabilities and short-term goals could behave as if it had long-term goals.[1] Indeed, suppose we had an agent whose (short-term) goals were to: generate a long-term plan (consisting of short-term steps) which would lead to as much long-term company profit (or whatever else) as possible; execute the first step in the plan; and repeat. Such an agent would behave as if it were pursuing the long-term goal of company profit, even though it had only the short-term goals of generating plans and optimizing arbitrary short-term goals. (In fact, it seems plausible to me that something like this is how humans act as long-term agents; do I really have long-term goals, or do I just competently pursue short-term goals—including the goal of making long-term plans—which have the overall effect of achieving long-term goals which my culture has instilled in me?)
Second, due to power-seeking behavior, misaligned long-term goals could lead to catastrophe even without significantly superhuman long-term planning ability. (This is a counterargument to point (3).) Suppose an AI system has a malign long-term goal as well as good long-term planning. Because of how difficult it is to predict the future (as pointed out in this post), the agent might realize that, rather than pursuing any particular takeover plan, it should instead generally accumulate resources and power. Since this is generally a good way to pursue any long-term goal, this wouldn’t necessarily set off red flags. Once the agent has enough resources and power, it may be able to decisively disempower humanity, even without significantly better long-term planning than humans. (How good does the agent’s long-term planning need to be to accumulate enough resources to make such a decisive strike possible? I could imagine that even sub-human long-term planning might be enough, though superhuman long-term planning would certainly make it easier.)
In this comment, Paul describes two other mechanisms by which long-term goals could form. One important difference between the story I share here and the ones that Paul describes is that Paul’s stories result in intrinsic goals, whereas my story results in goals which are neither intrinsic nor instrumental, but emergent. I’ll note that deceptive alignment requires a misaligned long-term intrinsic goal, so the story I tell here doesn’t affect my estimate of the likelihood of deceptive alignment.
Re your second critique: why do you think an AI system (without superhuman long-term planning ability) would be more likely to take over the world this way than an actor controlled by humans (augmented with short-term AI systems) who have long-term goals that would be instrumentally served by world domination?
I think that a competent human actor assisted by short-term AI systems plausibly could take over the world this way; I’m just inclined to call that a misuse problem rather than an alignment problem. (Or in other words, fixing that requires solving the human alignment problem, which feels like it requires different solutions, e.g. coordination and governmental oversight, than the AI alignment problem.)
In those terms, what we’re suggesting is that, in the vision of the future we sketch, the same sorts of solutions might be useful for preventing both AI takeover and human takeover. Even if an AI has misaligned goals, coordination and mutually assured destruction and other “human alignment” solutions could be effective in stymying it, so long as the AI isn’t significantly more capable than its human-run adversaries.
I’m confused about your first critique. You say the agent has a goal of generating a long-term plan which leads to as much long-term profit as possible; why do you call this a short-term goal, rather than a long-term goal? Do you mean that the agent only takes actions over a short period of time? That’s true in some sense in your example, but I would still characterize this as a long-term goal because success (maximizing profit) is determined by long-term results (which depend on the long-term dynamics of a complex system, etc.).
I see two distinctions between a system like the one I described and a system with long-term goals in the usual sense. First, the goal “write down a plan which, if followed, would lead to long-term profit” is itself a short-term goal which could plausibly be trained up to human-level with a short-term objective function (by training on human-generated predictions). So I think this mechanism avoids the arguments made in claims 4 and 5 of the post for the implausibility of long-term goals (which is my motivation for mentioning it). (I can’t tell if claim 6 was supposed to be addressing long-term goal formation stories like this one.)
Second, the intrinsic goals of the system I described are all short-term (output the text of a plan for a long-term goal; pursue various short-term goals),so the possible alignment failures for such a system might need to be analyzed differently than those of a system with long-term intrinsic goals. For example, such a system might not plan ahead of time to disempower humans (since such disempowerment would come in the long-term, which it doesn’t intrinsically care about). But once it finds that it has enough resources and power to disempower humans, it might then decide to take a catastrophic action, despite not having planned it in advance.
Ah, I think I understand what you meant now. The reward for this agent is not determined by the actual long-term consequences of its action, but by the predicted long-term consequences. In that case, yes, this seems like it might be an interesting middle ground between what we are calling short-term and long-term AIs. Though it still feels closer to a long-term agent to me—I’m confused about why you think it would both (a) not plan ahead of time to disempower humans, and (b) disempower humans when it has the chance. If the predictive model is accurate enough such that it is predictable that disempowering humans would be instrumentally useful, then wouldn’t the model incorporate that into its earlier plans?
Yep, sorry, I’ve probably been pretty unclear with describing this example.
I’m confused about why you think it both (a) not plan ahead of time to disempower humans, and (b) disempower humans when it has the chance. If the predictive model is accurate enough such that it is predictable that disempowering humans would be instrumentally useful, then wouldn’t the model incorporate that into its earlier plans?
For the sake of simplicity, let’s ignore the “execute on arbitrary short-term goals” half of the system I described, and just consider a system which was trained to generate long-term plans that maximize profit by imitating human-generated plans; if we label our training data with information about plan quality, such an agent could plausibly output plans as good as a human expert planner. So the training goal of this system is the short-term goal of minimizing next-token prediction loss during the current episode.
Let’s suppose that this agent is subtly inner misaligned with the (still short-term) inner goal of minimizing the value in the memory address which, during training, represented the current episode’s prediction loss. So if this agent could wirehead by directly setting this number to be very small before the end of its current episode, it would do so. But if it’s not able to do that, it would likely exhibit the behavior we want (outputting plans which mimic its training data), since that’s the best way to get the value in the memory address it cares about to be small. In particular, this system wouldn’t be interested in outputting plans which would manipulate us into setting this value to be small in the future, since it myopically only cares about this memory address’s value during the current episode.
But if at some point the system accumulates lots of resources and power by normal means, it may find that it has the option of e.g. suddenly taking control of the computers that it’s running on and manually setting the value in the memory address it cares about to be small. (I’m assuming here that the system can do things other than just output tokens, e.g. search the internet, contact human experts on the side, etc., so that it could plausibly have a way of taking over its computing cluster without ending the current episode.) So this is a bad action that the system wouldn’t have planned on setting up ahead of time, but would take if it found it was able to.
Thanks for writing this—I found it interesting, thoughtful, and well-written.
One distinction which seems useful to make is between:
long-term goals
long-term planning
long-term capabilities (i.e. the ability to reliably impact the long-term future in a particular way).
It seems to me that this post argues that:
AI systems’ long-term planning won’t be that much better than humans’ (claims 1 and 3).
AI systems won’t develop long-term goals (claims 4, 5, and 6).
Given (1) (and given that both humans and AI systems with long-term goals will have access to systems with the same short-term capabilities), AI systems won’t have much better long-term capabilities than humans + their AI assistants.
Before going on, I’d like to say that point (3) was quite novel and interesting to me—thanks for making it! This bolsters the case for “successfully aligning the AI systems we have now might be sufficient for keeping us safe from future more general AI systems.”
There are two critiques I’d like to make. First, I’d like to push back on claim (2); namely, I’ll posit a mechanism by which an agent with (good but not necessarily superhuman) long-term planning capabilities and short-term goals could behave as if it had long-term goals.[1] Indeed, suppose we had an agent whose (short-term) goals were to: generate a long-term plan (consisting of short-term steps) which would lead to as much long-term company profit (or whatever else) as possible; execute the first step in the plan; and repeat. Such an agent would behave as if it were pursuing the long-term goal of company profit, even though it had only the short-term goals of generating plans and optimizing arbitrary short-term goals. (In fact, it seems plausible to me that something like this is how humans act as long-term agents; do I really have long-term goals, or do I just competently pursue short-term goals—including the goal of making long-term plans—which have the overall effect of achieving long-term goals which my culture has instilled in me?)
Second, due to power-seeking behavior, misaligned long-term goals could lead to catastrophe even without significantly superhuman long-term planning ability. (This is a counterargument to point (3).) Suppose an AI system has a malign long-term goal as well as good long-term planning. Because of how difficult it is to predict the future (as pointed out in this post), the agent might realize that, rather than pursuing any particular takeover plan, it should instead generally accumulate resources and power. Since this is generally a good way to pursue any long-term goal, this wouldn’t necessarily set off red flags. Once the agent has enough resources and power, it may be able to decisively disempower humanity, even without significantly better long-term planning than humans. (How good does the agent’s long-term planning need to be to accumulate enough resources to make such a decisive strike possible? I could imagine that even sub-human long-term planning might be enough, though superhuman long-term planning would certainly make it easier.)
In this comment, Paul describes two other mechanisms by which long-term goals could form. One important difference between the story I share here and the ones that Paul describes is that Paul’s stories result in intrinsic goals, whereas my story results in goals which are neither intrinsic nor instrumental, but emergent. I’ll note that deceptive alignment requires a misaligned long-term intrinsic goal, so the story I tell here doesn’t affect my estimate of the likelihood of deceptive alignment.
Re your second critique: why do you think an AI system (without superhuman long-term planning ability) would be more likely to take over the world this way than an actor controlled by humans (augmented with short-term AI systems) who have long-term goals that would be instrumentally served by world domination?
I think that a competent human actor assisted by short-term AI systems plausibly could take over the world this way; I’m just inclined to call that a misuse problem rather than an alignment problem. (Or in other words, fixing that requires solving the human alignment problem, which feels like it requires different solutions, e.g. coordination and governmental oversight, than the AI alignment problem.)
In those terms, what we’re suggesting is that, in the vision of the future we sketch, the same sorts of solutions might be useful for preventing both AI takeover and human takeover. Even if an AI has misaligned goals, coordination and mutually assured destruction and other “human alignment” solutions could be effective in stymying it, so long as the AI isn’t significantly more capable than its human-run adversaries.
I’m confused about your first critique. You say the agent has a goal of generating a long-term plan which leads to as much long-term profit as possible; why do you call this a short-term goal, rather than a long-term goal? Do you mean that the agent only takes actions over a short period of time? That’s true in some sense in your example, but I would still characterize this as a long-term goal because success (maximizing profit) is determined by long-term results (which depend on the long-term dynamics of a complex system, etc.).
I see two distinctions between a system like the one I described and a system with long-term goals in the usual sense. First, the goal “write down a plan which, if followed, would lead to long-term profit” is itself a short-term goal which could plausibly be trained up to human-level with a short-term objective function (by training on human-generated predictions). So I think this mechanism avoids the arguments made in claims 4 and 5 of the post for the implausibility of long-term goals (which is my motivation for mentioning it). (I can’t tell if claim 6 was supposed to be addressing long-term goal formation stories like this one.)
Second, the intrinsic goals of the system I described are all short-term (output the text of a plan for a long-term goal; pursue various short-term goals),so the possible alignment failures for such a system might need to be analyzed differently than those of a system with long-term intrinsic goals. For example, such a system might not plan ahead of time to disempower humans (since such disempowerment would come in the long-term, which it doesn’t intrinsically care about). But once it finds that it has enough resources and power to disempower humans, it might then decide to take a catastrophic action, despite not having planned it in advance.
Ah, I think I understand what you meant now. The reward for this agent is not determined by the actual long-term consequences of its action, but by the predicted long-term consequences. In that case, yes, this seems like it might be an interesting middle ground between what we are calling short-term and long-term AIs. Though it still feels closer to a long-term agent to me—I’m confused about why you think it would both (a) not plan ahead of time to disempower humans, and (b) disempower humans when it has the chance. If the predictive model is accurate enough such that it is predictable that disempowering humans would be instrumentally useful, then wouldn’t the model incorporate that into its earlier plans?
Yep, sorry, I’ve probably been pretty unclear with describing this example.
For the sake of simplicity, let’s ignore the “execute on arbitrary short-term goals” half of the system I described, and just consider a system which was trained to generate long-term plans that maximize profit by imitating human-generated plans; if we label our training data with information about plan quality, such an agent could plausibly output plans as good as a human expert planner. So the training goal of this system is the short-term goal of minimizing next-token prediction loss during the current episode.
Let’s suppose that this agent is subtly inner misaligned with the (still short-term) inner goal of minimizing the value in the memory address which, during training, represented the current episode’s prediction loss. So if this agent could wirehead by directly setting this number to be very small before the end of its current episode, it would do so. But if it’s not able to do that, it would likely exhibit the behavior we want (outputting plans which mimic its training data), since that’s the best way to get the value in the memory address it cares about to be small. In particular, this system wouldn’t be interested in outputting plans which would manipulate us into setting this value to be small in the future, since it myopically only cares about this memory address’s value during the current episode.
But if at some point the system accumulates lots of resources and power by normal means, it may find that it has the option of e.g. suddenly taking control of the computers that it’s running on and manually setting the value in the memory address it cares about to be small. (I’m assuming here that the system can do things other than just output tokens, e.g. search the internet, contact human experts on the side, etc., so that it could plausibly have a way of taking over its computing cluster without ending the current episode.) So this is a bad action that the system wouldn’t have planned on setting up ahead of time, but would take if it found it was able to.