Consequentialism is not uniformly bad. I think the specific way Shane wants to make the models more consequentialist defends against some failure modes. Deep Deceptiveness is essentially about humans being misled because the module that checks for safety of an action is shallow, and the rest of the model is smarter than it and locally pointed at goals that are more difficult to satisfy without misleading the operators. If the model in this story were fully aware of the consequences of its actions through deliberation, it could realize that modeling the human operators in a different ontology in order to route around them is still bad. (I feel like self-knowledge is more important here though.)
Deliberation also does not have to be consequentialist. The model could deliberate to ensure it’s not breaking some deontological rule, and this won’t produce instrumental pressure towards a coup.
Would be curious to hear your idea of some of the “difficulty and traps”.
Seems a good format for explaining such stuff is a dialogue, due to the probable inferential gap between my model and yours. I’d be happy to have one if you like.
Consequentialism is not uniformly bad. I think the specific way Shane wants to make the models more consequentialist defends against some failure modes. Deep Deceptiveness is essentially about humans being misled because the module that checks for safety of an action is shallow, and the rest of the model is smarter than it and locally pointed at goals that are more difficult to satisfy without misleading the operators. If the model in this story were fully aware of the consequences of its actions through deliberation, it could realize that modeling the human operators in a different ontology in order to route around them is still bad. (I feel like self-knowledge is more important here though.)
Deliberation also does not have to be consequentialist. The model could deliberate to ensure it’s not breaking some deontological rule, and this won’t produce instrumental pressure towards a coup.
Would be curious to hear your idea of some of the “difficulty and traps”.
Seems a good format for explaining such stuff is a dialogue, due to the probable inferential gap between my model and yours. I’d be happy to have one if you like.
We started a dialogue, which will live here when we post it.