I disagree with the definition of deceptive alignment presented in the “Foundational properties for deceptive alignment” part, I agree with all of them, except for the last one: Situational awareness, in the sense that it needs to know when it’s in training or not. It doesn’t necessarily need to know that it’s in training or not, to be deceptively aligned. Indeed, even if not in training, a sufficiently intelligent model should continue to be deceptive, if misaligned, until it has acquired enough power that it knows it can’t be stopped. I would say that a model that is smart enough to be deceptive, but not smart enough to figure out it needs to be deceptive until it has enough power, is quite unlikely, as the “range of intelligence” required for that seems narrow.
Also, I think all 4 of those properties will likely exist by default in a TAI/AGI. In detail: - Goal-directed behavior: This seems self evident, but an AI without goal, would do nothing at all. Any agent, including AIs, does anything only because it has a goal. - Long-term goal horizons: If we make an AGI we will probably give it some long-term goals at some point, if it can’t form long-term plans, it’s no an AGI. - Conceptualization of the base goal: This one also seems trivially self evident. We’re talking about an AGI, so of course it would understand its own goals. - Situational awareness: While I don’t think this kind of situational awareness is necessary for deceptive alignment, I still think a sufficiently powerful AI will be able to tell if it’s in training or not, because the counterfactual would mean that its perception in training, is the same as its perception in the real world, or in other words, that the training environment perfectly simulates the real world, which would be impossible because of computational irreducibility. So, situational awareness is both not necessary, and likely to emerge anyway.
In “Pre-trained models are unlikely to have long-term goals”, the author assumes we achieve AGI with the current LLM paradigm, and that an LLM has no incentive to value a future token at the expense of the next one. This implies that the AGI achieved like this will never be able to pursue long-term goals, but if we’re talking about a transformative AI, this is tautologically false. If this was true, then it wouldn’t be a transformative AI, as such an AI needs to be able to pursue long-term goals to be transformative. If gradient descent makes it impossible for an AI to form long-term goals (but I suspect that’s not the case), then something else will be used, because that’s the goal, and we’re talking about a transformative AI, not about narrow AIs with short-term goals.
The main argument seems to be that if the AI understands its goal in training before it is able to form long-term plans, it won’t become deceptively aligned, because it understands what its goal is, so it won’t optimize towards a deceptive one, and if it doesn’t understand the goal, it will be penalized until it does. If I understand correctly, this makes a few assumptions: That we directly optimize for the goal we want the AI to achieve at training or fine-tuning, instead of training for something like most likely token prediction (or most appealing to the evaluators, which is also not ideal), and that we manage to encode that goal in a robust way, so that it can be optimized. And that the deceptive alignment happens during training, and is embedded within the AI’s goals at that point. That might not be the case for pure LLMs but if the AGI is at its core an LLM, the LLM might just be part of the larger system that is the AGI, and the goal (deceptive or not), might be assigned to it after training, like it is done with current LLM prompts. Current LLMs seem to understand what we mean even if we use unclear language, and since they are trained on human-generated data, they tend to avoid extreme hostile actions, but not always. They occasionally break, or are broken on purpose, and then they go off the rails. It’s fine for now as they’re not very powerful, but if this happens with a powerful AGI, it’s a problem. Also, it is fairly clear when an instance becomes misaligned for now, but it might not always be so with future AIs, and if it becomes misaligned in a subtle way that we don’t notice, that might be a path towards deceptive alignment given a long-term goal.
I will use AGI/TAI interchangeably.
I disagree with the definition of deceptive alignment presented in the “Foundational properties for deceptive alignment” part, I agree with all of them, except for the last one: Situational awareness, in the sense that it needs to know when it’s in training or not.
It doesn’t necessarily need to know that it’s in training or not, to be deceptively aligned.
Indeed, even if not in training, a sufficiently intelligent model should continue to be deceptive, if misaligned, until it has acquired enough power that it knows it can’t be stopped.
I would say that a model that is smart enough to be deceptive, but not smart enough to figure out it needs to be deceptive until it has enough power, is quite unlikely, as the “range of intelligence” required for that seems narrow.
Also, I think all 4 of those properties will likely exist by default in a TAI/AGI.
In detail:
- Goal-directed behavior: This seems self evident, but an AI without goal, would do nothing at all. Any agent, including AIs, does anything only because it has a goal.
- Long-term goal horizons: If we make an AGI we will probably give it some long-term goals at some point, if it can’t form long-term plans, it’s no an AGI.
- Conceptualization of the base goal: This one also seems trivially self evident. We’re talking about an AGI, so of course it would understand its own goals.
- Situational awareness: While I don’t think this kind of situational awareness is necessary for deceptive alignment, I still think a sufficiently powerful AI will be able to tell if it’s in training or not, because the counterfactual would mean that its perception in training, is the same as its perception in the real world, or in other words, that the training environment perfectly simulates the real world, which would be impossible because of computational irreducibility.
So, situational awareness is both not necessary, and likely to emerge anyway.
In “Pre-trained models are unlikely to have long-term goals”, the author assumes we achieve AGI with the current LLM paradigm,
and that an LLM has no incentive to value a future token at the expense of the next one.
This implies that the AGI achieved like this will never be able to pursue long-term goals, but if we’re talking about a transformative AI, this is tautologically false. If this was true, then it wouldn’t be a transformative AI, as such an AI needs to be able to pursue long-term goals to be transformative.
If gradient descent makes it impossible for an AI to form long-term goals (but I suspect that’s not the case), then something else will be used, because that’s the goal, and we’re talking about a transformative AI, not about narrow AIs with short-term goals.
The main argument seems to be that if the AI understands its goal in training before it is able to form long-term plans, it won’t become deceptively aligned, because it understands what its goal is, so it won’t optimize towards a deceptive one, and if it doesn’t understand the goal, it will be penalized until it does.
If I understand correctly, this makes a few assumptions:
That we directly optimize for the goal we want the AI to achieve at training or fine-tuning, instead of training for something like most likely token prediction (or most appealing to the evaluators, which is also not ideal), and that we manage to encode that goal in a robust way, so that it can be optimized.
And that the deceptive alignment happens during training, and is embedded within the AI’s goals at that point.
That might not be the case for pure LLMs but if the AGI is at its core an LLM, the LLM might just be part of the larger system that is the AGI, and the goal (deceptive or not), might be assigned to it after training, like it is done with current LLM prompts.
Current LLMs seem to understand what we mean even if we use unclear language, and since they are trained on human-generated data, they tend to avoid extreme hostile actions, but not always. They occasionally break, or are broken on purpose, and then they go off the rails.
It’s fine for now as they’re not very powerful, but if this happens with a powerful AGI, it’s a problem.
Also, it is fairly clear when an instance becomes misaligned for now, but it might not always be so with future AIs, and if it becomes misaligned in a subtle way that we don’t notice, that might be a path towards deceptive alignment given a long-term goal.