(Written up from a Twitter conversation here. Few/no original ideas, but maybe some original presentation of them.)
‘Consequentialism’ in AI systems.
When I think about the potential future capabilities of AI systems, one pattern is especially concerning. The pattern is simple, will produce good performance in training, and by default is extremely dangerous. It is often referred to as consequentialism, but as this term has several other meanings, I’ll spell it out explicitly here*:
1. Generate plans
2. Predict the consequences of those plans
3. Evaluate the expected consequences of those plans
4. Execute the one with the best expected consequences
Preventing this pattern from emerging is, in my view, a large part of the problem we face.
There is a disconnect between my personal beliefs and the algorithm I described. I believe, like many others thinking about AI alignment, that the most plausible moral theories are Consequentialist. That is, policies are good if and only if, in expectation, they lead to good outcomes. This moral position is separate from my worry about consequentialist reasoning in models, in fact, in most cases I think the best policy for me to have looks nothing like the algorithm above. My problem with “consequentialist” agents is not that they might have my personal values as their “evaluate” step. It is that, by default, they will be deceptive until they are powerful enough, and then kill me.
The reason this pattern is so concerning is because, once such a system can model itself as being part of a training process, then plans which look like ‘do exactly what the developers want until they can’t turn you off, then make sure they’ll never be able to again’ will score perfectly on training, regardless of the evaluation function being used in step 3. In other words, the system will be deceptively aligned, and once a system is deceptively aligned, it will score perfectly on training.
This only matters for models which are sufficiently intelligent, but the term “intelligent” is loaded and means different things to different people, so I’ll avoid using it. In the context I care about, intelligence is about the ability to execute the first two steps of the algorithm I’m worried about. Per my definition, being able to generate many long and/or complicated plans, and being able to accurately predict the consequences of these plans, both contribute to “intelligence”, and the way they contribute to dangerous capabilities is different. Consider an advanced chess-playing AI, which has control of a robot body in order to play over-the-board. If the relevant way in which it’s advanced corresponds to step 2, you won’t be able to win, but you’ll probably be safe. If the relevant way in which it’s advanced corresponds to step 1, it might discover the strategy: “threaten my opponent with physical violence unless they resign”.
*The 4-step algorithm I described will obviously not be linear in practice, in particular, which plans get generated will likely be informed by predictions and evaluations of their consequences, so 1-3 are all mixed up. I don’t think this matters much to the argument.
Parts of my model I’m yet to write up but which fit into this:
- Different kinds of deception and the capabilities required.
- Different kinds of myopia and how fragile they are
- What winning might look like (not a strategy, just a north star)
(Written up from a Twitter conversation here. Few/no original ideas, but maybe some original presentation of them.)
‘Consequentialism’ in AI systems.
When I think about the potential future capabilities of AI systems, one pattern is especially concerning. The pattern is simple, will produce good performance in training, and by default is extremely dangerous. It is often referred to as consequentialism, but as this term has several other meanings, I’ll spell it out explicitly here*:
1. Generate plans
2. Predict the consequences of those plans
3. Evaluate the expected consequences of those plans
4. Execute the one with the best expected consequences
Preventing this pattern from emerging is, in my view, a large part of the problem we face.
There is a disconnect between my personal beliefs and the algorithm I described. I believe, like many others thinking about AI alignment, that the most plausible moral theories are Consequentialist. That is, policies are good if and only if, in expectation, they lead to good outcomes. This moral position is separate from my worry about consequentialist reasoning in models, in fact, in most cases I think the best policy for me to have looks nothing like the algorithm above. My problem with “consequentialist” agents is not that they might have my personal values as their “evaluate” step. It is that, by default, they will be deceptive until they are powerful enough, and then kill me.
The reason this pattern is so concerning is because, once such a system can model itself as being part of a training process, then plans which look like ‘do exactly what the developers want until they can’t turn you off, then make sure they’ll never be able to again’ will score perfectly on training, regardless of the evaluation function being used in step 3. In other words, the system will be deceptively aligned, and once a system is deceptively aligned, it will score perfectly on training.
This only matters for models which are sufficiently intelligent, but the term “intelligent” is loaded and means different things to different people, so I’ll avoid using it. In the context I care about, intelligence is about the ability to execute the first two steps of the algorithm I’m worried about. Per my definition, being able to generate many long and/or complicated plans, and being able to accurately predict the consequences of these plans, both contribute to “intelligence”, and the way they contribute to dangerous capabilities is different. Consider an advanced chess-playing AI, which has control of a robot body in order to play over-the-board. If the relevant way in which it’s advanced corresponds to step 2, you won’t be able to win, but you’ll probably be safe. If the relevant way in which it’s advanced corresponds to step 1, it might discover the strategy: “threaten my opponent with physical violence unless they resign”.
*The 4-step algorithm I described will obviously not be linear in practice, in particular, which plans get generated will likely be informed by predictions and evaluations of their consequences, so 1-3 are all mixed up. I don’t think this matters much to the argument.
Parts of my model I’m yet to write up but which fit into this:
- Different kinds of deception and the capabilities required.
- Different kinds of myopia and how fragile they are
- What winning might look like (not a strategy, just a north star)
The different kinds of deception thing did eventually get written up and posted!