I like this thread; I think it represents an important piece of the puzzle, and I’m hoping to write something more detailed on it soon, but here’s a brief one.
My take is roughly: search/planning is one important ingredient of ‘consequentialism’ (in fact it is perhaps definitional, the way I understand consequentialism). When you have a consequentialist adversary (with strategic awareness[1]), you should (all other things equal) expect it to be more resilient to your attempts to put it out of action. Why? An otherwise similarly-behaved-in-training system which isn’t a consequentialist must have learned some heuristics during training. Those heuristics will sometimes be lucky and rest on abstractions which generalise some amount outside of the training distribution. But the further from the training distribution, the more vanishingly-probable it is that the abstractions and heuristics remain suitable. So you should be more optimistic about taking it down if you need to. In contrast, a consequentialist system can refine or even replace its heuristics in response to changes in inputs (by doing search/planning).
Another (perhaps independent?) ingredient is the ability to refine and augment the abstractions and world/strategic model on which the planning rests (play/experimentation). I would be even more pessimistic about a playful consequentialist adversary, because I’d expect its consequentialism to keep working even further (perhaps indefinitely far) outside the training distribution, given the opportunity to experiment.
I like this thread; I think it represents an important piece of the puzzle, and I’m hoping to write something more detailed on it soon, but here’s a brief one.
My take is roughly: search/planning is one important ingredient of ‘consequentialism’ (in fact it is perhaps definitional, the way I understand consequentialism). When you have a consequentialist adversary (with strategic awareness[1]), you should (all other things equal) expect it to be more resilient to your attempts to put it out of action. Why? An otherwise similarly-behaved-in-training system which isn’t a consequentialist must have learned some heuristics during training. Those heuristics will sometimes be lucky and rest on abstractions which generalise some amount outside of the training distribution. But the further from the training distribution, the more vanishingly-probable it is that the abstractions and heuristics remain suitable. So you should be more optimistic about taking it down if you need to. In contrast, a consequentialist system can refine or even replace its heuristics in response to changes in inputs (by doing search/planning).
Another (perhaps independent?) ingredient is the ability to refine and augment the abstractions and world/strategic model on which the planning rests (play/experimentation). I would be even more pessimistic about a playful consequentialist adversary, because I’d expect its consequentialism to keep working even further (perhaps indefinitely far) outside the training distribution, given the opportunity to experiment.
roughly I mean ‘knows about humans and ways to interact with and influence them’ https://www.alignmentforum.org/posts/HduCjmXTBD4xYTegv/draft-report-on-existential-risk-from-power-seeking-ai (and see some discussion here https://www.alignmentforum.org/posts/cCMihiwtZx7kdcKgt/comments-on-carlsmith-s-is-power-seeking-ai-an-existential)