Ah, looks like I missed this question for quite a while!
I agree that it’s not quite one or the other. I think that like wireheading, we can split delusion box into “the easy problem” and “the hard problem”. The easy delusion box is solved by making a reward/utility which is model-based, and so, knows that the delusion box isn’t real. Then, much like observation-utility functions, the agent won’t think entering into the delusion box is a good idea when it’s planning—and also, won’t get any reward even if it enters into the delusion box accidentally (so long as it knows this has happened).
But the hard problem of delusion box would be: we can’t make a perfect model of the real world in order to have model-based avoidance of the delusion box. So how to we guarantee that an agent avoids “generalized delusion boxes”?
Ah, looks like I missed this question for quite a while!
I agree that it’s not quite one or the other. I think that like wireheading, we can split delusion box into “the easy problem” and “the hard problem”. The easy delusion box is solved by making a reward/utility which is model-based, and so, knows that the delusion box isn’t real. Then, much like observation-utility functions, the agent won’t think entering into the delusion box is a good idea when it’s planning—and also, won’t get any reward even if it enters into the delusion box accidentally (so long as it knows this has happened).
But the hard problem of delusion box would be: we can’t make a perfect model of the real world in order to have model-based avoidance of the delusion box. So how to we guarantee that an agent avoids “generalized delusion boxes”?