(This is the second time someone asks this, so the fault is probably with the post and I should edit it somehow.)
The difference is that the maze AI is running a search. (The classifier isn’t; it’s just applying a bunch of rules.) This matters because that’s where the whole thing gets dangerous. If you get the last part on deceptive and proxy alignment, those concepts only make sense once we’re in the business of optimizing, i.e., running a search for actions that score well according to some utility function. In that setting, it makes sense to think of the inner thing as an “optimizer” or “agent” that has goals/wants things/etc.
What’s the conceptual difference between “running a search” and “applying a bunch of rules”? Whatever rules the cat AI is applying to the image must be implemented by some step-by-step algorithm, and it seems to me like that could probably be represented as running a search over some space. Similarly, you could abstract away the step-by-step understanding of how breadth-first search works and say that the maze AI is applying the rule of “return the shortest path to the red door”.
Yeah, very good question. The honest answer is that I don’t know; I had this distinction in mind when I wrote the post, but pressed with it, I don’t know if there’s a simple way to capture it. Someone on the AstralCodexTen article just asked the same, and the best I came up with is “the set of possible outputs is very large and contains harmful elements”. This would certainly be a necessary criterion; if every output is harmless, the system can’t be dangerous. (GPT already fails this.)
But even if there is no qualitative step, you can view it as a spectrum of competence, and deceptive/proxy alignment start being a possibility at some point on the spectrum. Not having the crisp characterization doesn’t make the dangerous behavior go away.
I like this thread; I think it represents an important piece of the puzzle, and I’m hoping to write something more detailed on it soon, but here’s a brief one.
My take is roughly: search/planning is one important ingredient of ‘consequentialism’ (in fact it is perhaps definitional, the way I understand consequentialism). When you have a consequentialist adversary (with strategic awareness[1]), you should (all other things equal) expect it to be more resilient to your attempts to put it out of action. Why? An otherwise similarly-behaved-in-training system which isn’t a consequentialist must have learned some heuristics during training. Those heuristics will sometimes be lucky and rest on abstractions which generalise some amount outside of the training distribution. But the further from the training distribution, the more vanishingly-probable it is that the abstractions and heuristics remain suitable. So you should be more optimistic about taking it down if you need to. In contrast, a consequentialist system can refine or even replace its heuristics in response to changes in inputs (by doing search/planning).
Another (perhaps independent?) ingredient is the ability to refine and augment the abstractions and world/strategic model on which the planning rests (play/experimentation). I would be even more pessimistic about a playful consequentialist adversary, because I’d expect its consequentialism to keep working even further (perhaps indefinitely far) outside the training distribution, given the opportunity to experiment.
(This is the second time someone asks this, so the fault is probably with the post and I should edit it somehow.)
The difference is that the maze AI is running a search. (The classifier isn’t; it’s just applying a bunch of rules.) This matters because that’s where the whole thing gets dangerous. If you get the last part on deceptive and proxy alignment, those concepts only make sense once we’re in the business of optimizing, i.e., running a search for actions that score well according to some utility function. In that setting, it makes sense to think of the inner thing as an “optimizer” or “agent” that has goals/wants things/etc.
What’s the conceptual difference between “running a search” and “applying a bunch of rules”? Whatever rules the cat AI is applying to the image must be implemented by some step-by-step algorithm, and it seems to me like that could probably be represented as running a search over some space. Similarly, you could abstract away the step-by-step understanding of how breadth-first search works and say that the maze AI is applying the rule of “return the shortest path to the red door”.
Yeah, very good question. The honest answer is that I don’t know; I had this distinction in mind when I wrote the post, but pressed with it, I don’t know if there’s a simple way to capture it. Someone on the AstralCodexTen article just asked the same, and the best I came up with is “the set of possible outputs is very large and contains harmful elements”. This would certainly be a necessary criterion; if every output is harmless, the system can’t be dangerous. (GPT already fails this.)
But even if there is no qualitative step, you can view it as a spectrum of competence, and deceptive/proxy alignment start being a possibility at some point on the spectrum. Not having the crisp characterization doesn’t make the dangerous behavior go away.
I like this thread; I think it represents an important piece of the puzzle, and I’m hoping to write something more detailed on it soon, but here’s a brief one.
My take is roughly: search/planning is one important ingredient of ‘consequentialism’ (in fact it is perhaps definitional, the way I understand consequentialism). When you have a consequentialist adversary (with strategic awareness[1]), you should (all other things equal) expect it to be more resilient to your attempts to put it out of action. Why? An otherwise similarly-behaved-in-training system which isn’t a consequentialist must have learned some heuristics during training. Those heuristics will sometimes be lucky and rest on abstractions which generalise some amount outside of the training distribution. But the further from the training distribution, the more vanishingly-probable it is that the abstractions and heuristics remain suitable. So you should be more optimistic about taking it down if you need to. In contrast, a consequentialist system can refine or even replace its heuristics in response to changes in inputs (by doing search/planning).
Another (perhaps independent?) ingredient is the ability to refine and augment the abstractions and world/strategic model on which the planning rests (play/experimentation). I would be even more pessimistic about a playful consequentialist adversary, because I’d expect its consequentialism to keep working even further (perhaps indefinitely far) outside the training distribution, given the opportunity to experiment.
roughly I mean ‘knows about humans and ways to interact with and influence them’ https://www.alignmentforum.org/posts/HduCjmXTBD4xYTegv/draft-report-on-existential-risk-from-power-seeking-ai (and see some discussion here https://www.alignmentforum.org/posts/cCMihiwtZx7kdcKgt/comments-on-carlsmith-s-is-power-seeking-ai-an-existential)