As far as I can tell, Paul’s current proposal might still suffer from blackmail, like his earlier proposal which I commented on
What problem do you have in mind here?
I thought that the previous problem was mostly psychological, i.e. that if humans were rational agents then this AI would be roughly as vulnerable to blackmail as its designers. So I thought the issue was the psychological strangeness (and great length) of the weird hypothetical.
Here we have no such hypothetical, and the system’s behavior only depends on the predicted behavior of humans in the real world. That seems to address the narrow version of your concern.
I can see two analogous problems:
The learning system may believe itself to be in a simulation, as discussed here. But that seems to cash out to a straightforward technical problem: we want to ensure that as long as there is some important scenario where the AI behaves strangely, we provide feedback on one such scenario. And this technical problem both (1) looks like it should be soluble in principle, (2) looks pretty tractable right now, and (3) is needed to resolve a whole host of other problems (a bunch of exotic failures, but also a bunch of more prosaic difficulties).
The behavior of the system depends on what a human would do across many cases. We can sample from these cases at random, but maybe a small fraction of them are sufficiently surprising that they would lead a human to believe they are in a simulation. We would like to sample these situations with disproportionate probability as well so that the human can won’t panic when they find themselves in one, but I haven’t talked in any detail about how to do that and it’s not obvious whether it is possible. (I do think it’s possible.)
I mostly had in mind 2. Not sure how predicting humans is different from putting humans in hypotheticals. It seems like the same problems could happen.
I agree that the same problem appears for ALBA. I was originally working with proposals where the improbability of the human’s situation was bounded, but the recursive structure can lead to arbitrarily large improbability. I hadn’t thought about this explicitly.
Predicting humans is different from putting humans in hypotheticals, in the sense that in principle you can actually sample from the situations that cause humans to think they are in a simulation or whatever.
For example, suppose the human had access to a button that said “This is weird, I’m probably in a simulation,” and suppose that we expected the human to press it in any case where they would start behaving weirdly. Then we could potentially sample from the subset of situations where the human presses the button. And if we manage to do that, then the human isn’t right to suspect they are in a simulation (any more than they already should believe they are in a simulation, prior to even building the AI).
I agree that you shouldn’t expect the current version of the scheme to cope with this problem. And it’s certainly not obvious that the approach sketched above can work, there are many important questions. But I do think that there is a clear path to fixing this kind of problem.
(Interestingly, this is exactly analogous to what the active learning scheme has to do, in order to get the prediction algorithms to predict well.)
ETA: although I hadn’t thought about this explicitly in the context of ALBA, I have expected to need some way to overweight “weird” situations in order to stop them from being problematic, ever since here.
What problem do you have in mind here?
I thought that the previous problem was mostly psychological, i.e. that if humans were rational agents then this AI would be roughly as vulnerable to blackmail as its designers. So I thought the issue was the psychological strangeness (and great length) of the weird hypothetical.
Here we have no such hypothetical, and the system’s behavior only depends on the predicted behavior of humans in the real world. That seems to address the narrow version of your concern.
I can see two analogous problems:
The learning system may believe itself to be in a simulation, as discussed here. But that seems to cash out to a straightforward technical problem: we want to ensure that as long as there is some important scenario where the AI behaves strangely, we provide feedback on one such scenario. And this technical problem both (1) looks like it should be soluble in principle, (2) looks pretty tractable right now, and (3) is needed to resolve a whole host of other problems (a bunch of exotic failures, but also a bunch of more prosaic difficulties).
The behavior of the system depends on what a human would do across many cases. We can sample from these cases at random, but maybe a small fraction of them are sufficiently surprising that they would lead a human to believe they are in a simulation. We would like to sample these situations with disproportionate probability as well so that the human can won’t panic when they find themselves in one, but I haven’t talked in any detail about how to do that and it’s not obvious whether it is possible. (I do think it’s possible.)
Did you have in mind 1, 2, or something else?
I mostly had in mind 2. Not sure how predicting humans is different from putting humans in hypotheticals. It seems like the same problems could happen.
I agree that the same problem appears for ALBA. I was originally working with proposals where the improbability of the human’s situation was bounded, but the recursive structure can lead to arbitrarily large improbability. I hadn’t thought about this explicitly.
Predicting humans is different from putting humans in hypotheticals, in the sense that in principle you can actually sample from the situations that cause humans to think they are in a simulation or whatever.
For example, suppose the human had access to a button that said “This is weird, I’m probably in a simulation,” and suppose that we expected the human to press it in any case where they would start behaving weirdly. Then we could potentially sample from the subset of situations where the human presses the button. And if we manage to do that, then the human isn’t right to suspect they are in a simulation (any more than they already should believe they are in a simulation, prior to even building the AI).
I agree that you shouldn’t expect the current version of the scheme to cope with this problem. And it’s certainly not obvious that the approach sketched above can work, there are many important questions. But I do think that there is a clear path to fixing this kind of problem.
(Interestingly, this is exactly analogous to what the active learning scheme has to do, in order to get the prediction algorithms to predict well.)
ETA: although I hadn’t thought about this explicitly in the context of ALBA, I have expected to need some way to overweight “weird” situations in order to stop them from being problematic, ever since here.