As far as I can tell, Paul’s current proposal might still suffer from blackmail, like his earlier proposal which I commented on. I vaguely remember discussing the problem with you as well.
One big lesson for me is that AI research seems to be more incremental and predictable than we thought, and garage FOOM probably isn’t the main danger. It might be helpful to study the strengths and weaknesses of modern neural networks and get a feel for their generalization performance. Then we could try to predict which areas will see big gains from neural networks in the next few years, and which parts of Friendliness become easy or hard as a result. Is anyone at MIRI working on that?
Then we could try to predict which areas will see big gains from neural networks in the next few years, and which parts of Friendliness become easy or hard as a result. Is anyone at MIRI working on that?
If they did that, then what? Try to convince NN researchers to attack the parts of Friendliness that look hard? That seems difficult for MIRI to do given where they’ve invested in building their reputation (i.e., among decision theorists and mathematicians instead of in the ML community). (It would really depend on people trusting their experience and judgment since it’s hard to see how much one could offer in the form of either mathematical proof or clearly relevant empirical evidence.) You’d have a better chance if the work was carried out by some other organization. But even if that organization got NN researchers to take its results seriously, what incentives do they have to attack parts of Friendliness that seem especially hard, instead of doing what they’ve been doing, i.e., racing as fast as they can for the next milestone in capability?
Or is the idea to bet on the off chance that building an FAI with NN turns out to be easy enough that MIRI and like-minded researchers can solve the associated Friendliness problems themselves and then hand the solutions to whoever ends up leading the AGI race, and they can just plug the solutions in at little cost to their winning the race?
Or you’re suggesting aiming/hoping for some feasible combination of both, I guess. It seems pretty similar to what Paul Christiano is doing, except he has “generic AI technology” in place of “NN” above. To me, the chance of success of this approach seems low enough that it’s not obviously superior to what MIRI is doing (namely, in my view, betting on the off chance that the contrarian AI approach they’re taking ends up being much easier/better than the mainstream approach, which is looking increasingly unlikely but still not impossible).
One big lesson for me is that AI research seems to be more incremental and predictable than we thought, and garage FOOM probably isn’t the main danger.
That may be true but that is hindsight bias. MIRIs (or EYs for that matter) approach to hedge against that being true was nonetheless a very (and maybe given the knowledge at the time only) reasonable approach.
As far as I can tell, Paul’s current proposal might still suffer from blackmail, like his earlier proposal which I commented on
What problem do you have in mind here?
I thought that the previous problem was mostly psychological, i.e. that if humans were rational agents then this AI would be roughly as vulnerable to blackmail as its designers. So I thought the issue was the psychological strangeness (and great length) of the weird hypothetical.
Here we have no such hypothetical, and the system’s behavior only depends on the predicted behavior of humans in the real world. That seems to address the narrow version of your concern.
I can see two analogous problems:
The learning system may believe itself to be in a simulation, as discussed here. But that seems to cash out to a straightforward technical problem: we want to ensure that as long as there is some important scenario where the AI behaves strangely, we provide feedback on one such scenario. And this technical problem both (1) looks like it should be soluble in principle, (2) looks pretty tractable right now, and (3) is needed to resolve a whole host of other problems (a bunch of exotic failures, but also a bunch of more prosaic difficulties).
The behavior of the system depends on what a human would do across many cases. We can sample from these cases at random, but maybe a small fraction of them are sufficiently surprising that they would lead a human to believe they are in a simulation. We would like to sample these situations with disproportionate probability as well so that the human can won’t panic when they find themselves in one, but I haven’t talked in any detail about how to do that and it’s not obvious whether it is possible. (I do think it’s possible.)
I mostly had in mind 2. Not sure how predicting humans is different from putting humans in hypotheticals. It seems like the same problems could happen.
I agree that the same problem appears for ALBA. I was originally working with proposals where the improbability of the human’s situation was bounded, but the recursive structure can lead to arbitrarily large improbability. I hadn’t thought about this explicitly.
Predicting humans is different from putting humans in hypotheticals, in the sense that in principle you can actually sample from the situations that cause humans to think they are in a simulation or whatever.
For example, suppose the human had access to a button that said “This is weird, I’m probably in a simulation,” and suppose that we expected the human to press it in any case where they would start behaving weirdly. Then we could potentially sample from the subset of situations where the human presses the button. And if we manage to do that, then the human isn’t right to suspect they are in a simulation (any more than they already should believe they are in a simulation, prior to even building the AI).
I agree that you shouldn’t expect the current version of the scheme to cope with this problem. And it’s certainly not obvious that the approach sketched above can work, there are many important questions. But I do think that there is a clear path to fixing this kind of problem.
(Interestingly, this is exactly analogous to what the active learning scheme has to do, in order to get the prediction algorithms to predict well.)
ETA: although I hadn’t thought about this explicitly in the context of ALBA, I have expected to need some way to overweight “weird” situations in order to stop them from being problematic, ever since here.
As far as I can tell, Paul’s current proposal might still suffer from blackmail, like his earlier proposal which I commented on. I vaguely remember discussing the problem with you as well.
One big lesson for me is that AI research seems to be more incremental and predictable than we thought, and garage FOOM probably isn’t the main danger. It might be helpful to study the strengths and weaknesses of modern neural networks and get a feel for their generalization performance. Then we could try to predict which areas will see big gains from neural networks in the next few years, and which parts of Friendliness become easy or hard as a result. Is anyone at MIRI working on that?
If they did that, then what? Try to convince NN researchers to attack the parts of Friendliness that look hard? That seems difficult for MIRI to do given where they’ve invested in building their reputation (i.e., among decision theorists and mathematicians instead of in the ML community). (It would really depend on people trusting their experience and judgment since it’s hard to see how much one could offer in the form of either mathematical proof or clearly relevant empirical evidence.) You’d have a better chance if the work was carried out by some other organization. But even if that organization got NN researchers to take its results seriously, what incentives do they have to attack parts of Friendliness that seem especially hard, instead of doing what they’ve been doing, i.e., racing as fast as they can for the next milestone in capability?
Or is the idea to bet on the off chance that building an FAI with NN turns out to be easy enough that MIRI and like-minded researchers can solve the associated Friendliness problems themselves and then hand the solutions to whoever ends up leading the AGI race, and they can just plug the solutions in at little cost to their winning the race?
Or you’re suggesting aiming/hoping for some feasible combination of both, I guess. It seems pretty similar to what Paul Christiano is doing, except he has “generic AI technology” in place of “NN” above. To me, the chance of success of this approach seems low enough that it’s not obviously superior to what MIRI is doing (namely, in my view, betting on the off chance that the contrarian AI approach they’re taking ends up being much easier/better than the mainstream approach, which is looking increasingly unlikely but still not impossible).
That may be true but that is hindsight bias. MIRIs (or EYs for that matter) approach to hedge against that being true was nonetheless a very (and maybe given the knowledge at the time only) reasonable approach.
What problem do you have in mind here?
I thought that the previous problem was mostly psychological, i.e. that if humans were rational agents then this AI would be roughly as vulnerable to blackmail as its designers. So I thought the issue was the psychological strangeness (and great length) of the weird hypothetical.
Here we have no such hypothetical, and the system’s behavior only depends on the predicted behavior of humans in the real world. That seems to address the narrow version of your concern.
I can see two analogous problems:
The learning system may believe itself to be in a simulation, as discussed here. But that seems to cash out to a straightforward technical problem: we want to ensure that as long as there is some important scenario where the AI behaves strangely, we provide feedback on one such scenario. And this technical problem both (1) looks like it should be soluble in principle, (2) looks pretty tractable right now, and (3) is needed to resolve a whole host of other problems (a bunch of exotic failures, but also a bunch of more prosaic difficulties).
The behavior of the system depends on what a human would do across many cases. We can sample from these cases at random, but maybe a small fraction of them are sufficiently surprising that they would lead a human to believe they are in a simulation. We would like to sample these situations with disproportionate probability as well so that the human can won’t panic when they find themselves in one, but I haven’t talked in any detail about how to do that and it’s not obvious whether it is possible. (I do think it’s possible.)
Did you have in mind 1, 2, or something else?
I mostly had in mind 2. Not sure how predicting humans is different from putting humans in hypotheticals. It seems like the same problems could happen.
I agree that the same problem appears for ALBA. I was originally working with proposals where the improbability of the human’s situation was bounded, but the recursive structure can lead to arbitrarily large improbability. I hadn’t thought about this explicitly.
Predicting humans is different from putting humans in hypotheticals, in the sense that in principle you can actually sample from the situations that cause humans to think they are in a simulation or whatever.
For example, suppose the human had access to a button that said “This is weird, I’m probably in a simulation,” and suppose that we expected the human to press it in any case where they would start behaving weirdly. Then we could potentially sample from the subset of situations where the human presses the button. And if we manage to do that, then the human isn’t right to suspect they are in a simulation (any more than they already should believe they are in a simulation, prior to even building the AI).
I agree that you shouldn’t expect the current version of the scheme to cope with this problem. And it’s certainly not obvious that the approach sketched above can work, there are many important questions. But I do think that there is a clear path to fixing this kind of problem.
(Interestingly, this is exactly analogous to what the active learning scheme has to do, in order to get the prediction algorithms to predict well.)
ETA: although I hadn’t thought about this explicitly in the context of ALBA, I have expected to need some way to overweight “weird” situations in order to stop them from being problematic, ever since here.