I’ve been trying to convince people that there will be strong trade-offs between safety and performance
What do you see as the best arguments for this claim? I haven’t seen much public argument for it and am definitely interested in seeing more. I definitely grant that it’s prima facie plausible (as is the alternative).
Some caveats:
It’s obvious there are trade-offs between safety and performance in the usual sense of “safety.” But we are interested in a special kind of failure, where a failed system ends up controlling a significant share of the entire universe’s resources (rather than e.g. causing an explosion), and it’s less obvious that preventing such failures necessarily requires a significant cost.
Its also obvious that there is an additional cost to be paid in order to solve control, e.g. consider the fact that we are currently spending time on it. But the question is how much additional work needs to be done. Does building aligned systems require 1000% more work? 10%? 0.1%? I don’t see why it should obvious that this number is on the order of 100% rather than 1%.
Similarly for performance costs. I’m willing to grant that an aligned system will be more expensive to run. But is that cost an extra 1000% or an extra 0.1%? Both seem quite plausible. From a theoretical perspective the question is whether the required overhead is linear or sublinear?
I haven’t seen strong arguments for the “linear overhead” side, and my current guess is that the answer is sublinear. But again, both positions seem quite plausible.
(There are currently a few major obstructions to my approach that could plausibly give a tight theoretical argument for linear overhead, such as the translation example in the discussion with Wei Dai. In the past such obstructions have ended up seeming surmountable, but I think that it is totally plausible that eventually one won’t. And at that point I hope to be able to make clean statements about exactly what kind of thing we can’t hope to do efficiently+safely / exactly what kinds of additional assumptions we would have to make / what the key obstructions are).
Personally, I tend to think that we ought to address the coordination problem head-on
I think this is a good idea and a good project, which I would really like to see more people working on. In the past I may have seemed more dismissive and if so I apologize for being misguided. I’ve spent a little bit of time thinking about it recently and my feeling is that there is a lot of productive and promising work to do.
My current guess is that AI control is the more valuable thing for me personally to do though I could imagine being convinced out of this.
I feel that AI control is valuable given that (a) it has a reasonable chance of succeeding even if we can’t solve these coordination problems, and (b) convincing evidence that the problem is hard would be a useful input into getting the AI community to coordinate.
If you managed to get AI researchers to effectively coordinate around conditionally restricting access to AI (if it proved to be dangerous), then that would seriously undermine argument (b). I believe that a sufficiently persuasive/charismatic/accomplished person could probably do this today.
If I ended up becoming convinced that AI control was impossible this would undermine argument (a) (though hopefully that impossibility argument could itself be used to satisfy desiderata (b)).
What do you see as the best arguments for this claim? I haven’t seen much public argument for it and am definitely interested in seeing more. I definitely grant that it’s prima facie plausible (as is the alternative).
Some caveats:
It’s obvious there are trade-offs between safety and performance in the usual sense of “safety.” But we are interested in a special kind of failure, where a failed system ends up controlling a significant share of the entire universe’s resources (rather than e.g. causing an explosion), and it’s less obvious that preventing such failures necessarily requires a significant cost.
Its also obvious that there is an additional cost to be paid in order to solve control, e.g. consider the fact that we are currently spending time on it. But the question is how much additional work needs to be done. Does building aligned systems require 1000% more work? 10%? 0.1%? I don’t see why it should obvious that this number is on the order of 100% rather than 1%.
Similarly for performance costs. I’m willing to grant that an aligned system will be more expensive to run. But is that cost an extra 1000% or an extra 0.1%? Both seem quite plausible. From a theoretical perspective the question is whether the required overhead is linear or sublinear?
I haven’t seen strong arguments for the “linear overhead” side, and my current guess is that the answer is sublinear. But again, both positions seem quite plausible.
(There are currently a few major obstructions to my approach that could plausibly give a tight theoretical argument for linear overhead, such as the translation example in the discussion with Wei Dai. In the past such obstructions have ended up seeming surmountable, but I think that it is totally plausible that eventually one won’t. And at that point I hope to be able to make clean statements about exactly what kind of thing we can’t hope to do efficiently+safely / exactly what kinds of additional assumptions we would have to make / what the key obstructions are).
I think this is a good idea and a good project, which I would really like to see more people working on. In the past I may have seemed more dismissive and if so I apologize for being misguided. I’ve spent a little bit of time thinking about it recently and my feeling is that there is a lot of productive and promising work to do.
My current guess is that AI control is the more valuable thing for me personally to do though I could imagine being convinced out of this.
I feel that AI control is valuable given that (a) it has a reasonable chance of succeeding even if we can’t solve these coordination problems, and (b) convincing evidence that the problem is hard would be a useful input into getting the AI community to coordinate.
If you managed to get AI researchers to effectively coordinate around conditionally restricting access to AI (if it proved to be dangerous), then that would seriously undermine argument (b). I believe that a sufficiently persuasive/charismatic/accomplished person could probably do this today.
If I ended up becoming convinced that AI control was impossible this would undermine argument (a) (though hopefully that impossibility argument could itself be used to satisfy desiderata (b)).