You definitely don’t understand what I’m getting at here, but I’m not yet sure exactly where the inductive gap is. I’ll emphasize a few particular things; let me know if any of this helps.
There’s this story about an airplane (I think the B-52 originally?) where the levers for the flaps and landing gear were identical and right next to each other. Pilots kept coming in to land, and accidentally retracting the landing gear. The point of the story is that this is a design problem with the plane more than a mistake on the pilots’ part; the problem was fixed by putting a little rubber wheel on the landing gear lever. If we put two identical levers right next to each other, it’s basically inevitable that mistakes will be made; that’s bad interface design.
AI has a similar problem, but far more severe, because the systems to which we are interfacing are far more conceptually complicated. If we have confusing interfaces on AI, which allow people to shoot the world in the foot, then the world will inevitably be shot in the foot, just like putting two identical levers next to each other guarantees that the wrong one will sometimes be pulled.
For tool AI in particular, the key piece is this:
the big value-proposition of powerful AI is its ability to reason about systems or problems too complicated for humans—which are exactly the systems/problems where safety issues are likely to be nonobvious. If we’re going to unlock the full value of AI at all, we’ll need to use it on problems where humans do not know the relevant safety issues.
The claim here is that either (a) the AI in question doesn’t achieve the main value prop of AI (i.e. reasoning about systems too complicated for humans), or (b) the system itself has to do the work of making sure it’s safe. If neither of those conditions are met, then mistakes will absolutely be made regularly. The human operator cannot be trusted to make sure what they’re asking for is safe, because they will definitely make mistakes.
On the other hand, if the AI itself is able to evaluate whether its outputs are safe, then we can potentially achieve very high levels of safety. It could plausibly never go wrong over the lifetime of the universe. Just like, if you design a tablesaw with an automatic shut-off, it could plausibly never cut off anybody’s finger. But if you design a tablesaw without an automatic shut-off, it is near-certain to cut off a finger from time to time. That level of safety can be achieved, in general, but it cannot be achieved while relying on the human operator not making mistakes.
Coming at it from a different angle: if a safety problem is handled by a system’s designer, then their die-roll happens once up-front. If that die-roll comes out favorably, then the system is safe (at least with respect to the problem under consideration); it avoids the problem by design. On the other hand, if a safety problem is left to the system’s users, then a die-roll happens every time the system is used, so inevitably some of those die rolls will come out unfavorably. Thus the importance of designing AI for safety up-front, rather than relying on users to use it safely.
Is it more clear what I’m getting at now and/or does this prompt further questions?
The claim here is that either (a) the AI in question doesn’t achieve the main value prop of AI (i.e. reasoning about systems too complicated for humans), or (b) the system itself has to do the work of making sure it’s safe.
I see the intuitive appeal of this claim, but it seems too strong. I suspect if we look at rates of accidents over time they’ll have been going down over time, at least for the last few centuries. It seems like this can continue going down, to an asymptote of zero, in the same way it has been so far—we become better at understanding how accidents happen and more careful in how we use dangerous technologies. We already use tools for this (in software, we use debuggers, profilers, type systems, etc) or delegate to other humans (as in a large company). We can continue to do so with AI systems.
I buy that eventually “most of the work” has to be done by the AI system, but it seems plausible that this won’t happen until well after advanced AI, and that advanced AI will help us in getting there. And so, that from a what-should-we-do perspective, it’s fine to rely on humans for some aspects of safety in the short term (though of course it would be preferable to delegate entirely to a system we knew was safe and beneficial).
(Why bother relying on humans? If you want to build a goal-directed AI system, it sure seems better if it’s under the control of some human, rather than not. It’s not clear what a plausible option is if you can’t have the AI system under the control of some human.)
In the die-roll analogy, the hope is the rate at which you roll dice approximately decays exponentially, so that you only roll an asymptotically constant number of dice.
I somehow agree with both you and OP, and also I don’t buy part of the lever analogy yet. It seems important that the levers not only look similar, but that they be close to each other, in order to expect users to reliably mess up. Similarly, strong tool AI will offer many, many affordances, and it isn’t clear how ″close″ I should expect them to be in use-space. From the security mindset, that’s sufficient cause for serious concern, but I’m still trying to shake out the expected value estimate for powerful tool AIs—will they be thermonuclear-weapon-like (as in your post), or will mistakes generally look different?
One way in which the analogy breaks down: in the lever case, we have two levers right next to each other, and each does something we want—it’s just easy to confuse the levers. A better analogy for AI might be: many levers and switches and dials have to be set to get the behavior we want, and mistakes in some of them matter while others don’t, and we don’t know which ones matter when. And sometimes people will figure out that a particular combination extends the flaps, so they’ll say “do this to extend the flaps”, except that when some other switch has the wrong setting and it’s between 4 and 5 pm on Thursday that combination will still extend the flaps, but it will also retract the landing gear, and nobody noticed that before they wrote down the instructions for how to extend the flaps.
Some features which this analogy better highlights:
Most of the interface-space does things we either don’t care about or actively do not want
Even among things which usually look like they do what we want, most do something we don’t want at least some of the time
The system has a lot of dimensions, we can’t brute-force check all combinations, and problems may be in seemingly-unrelated dimensions
You definitely don’t understand what I’m getting at here, but I’m not yet sure exactly where the inductive gap is. I’ll emphasize a few particular things; let me know if any of this helps.
There’s this story about an airplane (I think the B-52 originally?) where the levers for the flaps and landing gear were identical and right next to each other. Pilots kept coming in to land, and accidentally retracting the landing gear. The point of the story is that this is a design problem with the plane more than a mistake on the pilots’ part; the problem was fixed by putting a little rubber wheel on the landing gear lever. If we put two identical levers right next to each other, it’s basically inevitable that mistakes will be made; that’s bad interface design.
AI has a similar problem, but far more severe, because the systems to which we are interfacing are far more conceptually complicated. If we have confusing interfaces on AI, which allow people to shoot the world in the foot, then the world will inevitably be shot in the foot, just like putting two identical levers next to each other guarantees that the wrong one will sometimes be pulled.
For tool AI in particular, the key piece is this:
The claim here is that either (a) the AI in question doesn’t achieve the main value prop of AI (i.e. reasoning about systems too complicated for humans), or (b) the system itself has to do the work of making sure it’s safe. If neither of those conditions are met, then mistakes will absolutely be made regularly. The human operator cannot be trusted to make sure what they’re asking for is safe, because they will definitely make mistakes.
On the other hand, if the AI itself is able to evaluate whether its outputs are safe, then we can potentially achieve very high levels of safety. It could plausibly never go wrong over the lifetime of the universe. Just like, if you design a tablesaw with an automatic shut-off, it could plausibly never cut off anybody’s finger. But if you design a tablesaw without an automatic shut-off, it is near-certain to cut off a finger from time to time. That level of safety can be achieved, in general, but it cannot be achieved while relying on the human operator not making mistakes.
Coming at it from a different angle: if a safety problem is handled by a system’s designer, then their die-roll happens once up-front. If that die-roll comes out favorably, then the system is safe (at least with respect to the problem under consideration); it avoids the problem by design. On the other hand, if a safety problem is left to the system’s users, then a die-roll happens every time the system is used, so inevitably some of those die rolls will come out unfavorably. Thus the importance of designing AI for safety up-front, rather than relying on users to use it safely.
Is it more clear what I’m getting at now and/or does this prompt further questions?
Yeah, this makes much more sense.
I see the intuitive appeal of this claim, but it seems too strong. I suspect if we look at rates of accidents over time they’ll have been going down over time, at least for the last few centuries. It seems like this can continue going down, to an asymptote of zero, in the same way it has been so far—we become better at understanding how accidents happen and more careful in how we use dangerous technologies. We already use tools for this (in software, we use debuggers, profilers, type systems, etc) or delegate to other humans (as in a large company). We can continue to do so with AI systems.
I buy that eventually “most of the work” has to be done by the AI system, but it seems plausible that this won’t happen until well after advanced AI, and that advanced AI will help us in getting there. And so, that from a what-should-we-do perspective, it’s fine to rely on humans for some aspects of safety in the short term (though of course it would be preferable to delegate entirely to a system we knew was safe and beneficial).
(Why bother relying on humans? If you want to build a goal-directed AI system, it sure seems better if it’s under the control of some human, rather than not. It’s not clear what a plausible option is if you can’t have the AI system under the control of some human.)
In the die-roll analogy, the hope is the rate at which you roll dice approximately decays exponentially, so that you only roll an asymptotically constant number of dice.
I somehow agree with both you and OP, and also I don’t buy part of the lever analogy yet. It seems important that the levers not only look similar, but that they be close to each other, in order to expect users to reliably mess up. Similarly, strong tool AI will offer many, many affordances, and it isn’t clear how ″close″ I should expect them to be in use-space. From the security mindset, that’s sufficient cause for serious concern, but I’m still trying to shake out the expected value estimate for powerful tool AIs—will they be thermonuclear-weapon-like (as in your post), or will mistakes generally look different?
One way in which the analogy breaks down: in the lever case, we have two levers right next to each other, and each does something we want—it’s just easy to confuse the levers. A better analogy for AI might be: many levers and switches and dials have to be set to get the behavior we want, and mistakes in some of them matter while others don’t, and we don’t know which ones matter when. And sometimes people will figure out that a particular combination extends the flaps, so they’ll say “do this to extend the flaps”, except that when some other switch has the wrong setting and it’s between 4 and 5 pm on Thursday that combination will still extend the flaps, but it will also retract the landing gear, and nobody noticed that before they wrote down the instructions for how to extend the flaps.
Some features which this analogy better highlights:
Most of the interface-space does things we either don’t care about or actively do not want
Even among things which usually look like they do what we want, most do something we don’t want at least some of the time
The system has a lot of dimensions, we can’t brute-force check all combinations, and problems may be in seemingly-unrelated dimensions