Thanks! I think this is a case where good design principles for AGI diverge from good design principles for, say, self-driving cars.
“Minimizing expected harm done” is a very dangerous design principle for AGIs because of Goodhart’s law / specification gaming. For example, if you define “harm” as “humans dying or becoming injured”, then the AGI will be motivated to imprison all the humans in underground bunkers with padded walls! Worse, how do you write machine code for “expected harm”? It’s not so straightforward…and if there’s any edge case where your code messes up the calculation—i.e., where the expected-harm-calculation-module outputs an answer that diverges from actual expected harm—then the AGI may find that edge-case, and do something even worse than imprison all the humans in underground bunkers with padded walls!
I agree that “being paralyzed by indecision” can be dangerous if the AGI is controlling a moving train. But first of all, this is “everyday dangerous” as opposed to “global catastrophic risk dangerous”, which is what I’m worried about. And second of all, it can be mitigated by, well, not putting such an AGI in control of a moving train!! You don’t need a train-controlling AI to be an AGI with cross-domain knowledge and superhuman creative problem-solving abilities. We can just use normal 2021-type narrow AIs for our trains and cars. I mean, there were already self-driving trains in the 1960s. They’re not perfect, but neither are humans, it’s fine.
Someday we may really want to put an AGI in charge of an extremely complicated fast-moving system like an electric grid—where you’d like cross-domain knowledge and superhuman problem-solving abilities (e.g. to diagnose and solve a problem that’s never occurred before), and where stopping a cascading blackout can require making decisions in a split-second, too fast to put a human in the loop. In that case, “being paralyzed by indecision” is actually a pretty bad failure mode (albeit not necessarily worse than the status quo, since humans can also be paralyzed by indecision).
I would say: We don’t have to solve that problem right now. We can leave humans in charge of the electric grid a bit longer! Instead, let’s build AGIs that can work with humans to dramatically improve our foresight and reasoning abilities. With those AGI assistants by our sides, then we can tackle the question of what’s the best way to control the electric grid! Maybe we’ll come up with a redesigned next-generation AGI that can be safe without being conservative. Or maybe we’ll design a special-purpose electric-grid-controlling narrow AI. Or maybe we’ll even stick with humans! I don’t know, and we don’t have to figure it out now.
In other words, I think that the goal right now is to solve the problem of safe powerful human-in-the-loop AGI, and then we can use those systems to help us think about what to do in cases where we can’t have a human in the loop.
I agree that there’s a sense in which “leave the lever alone” is a decision. However, I’m optimistic that we can program an AGI to treat NOOP as having a special status, so it’s the default output when all options are unpalatable. To be crystal-clear: I’m not claiming that this happens naturally, I’m proposing that this a thing that we deliberately write into the AGI source code. And the reason I want NOOP to have a special status is that an AGI will definitely not cause an irreversible global catastrophe by NOOP’ing. :-)
Thanks! I think this is a case where good design principles for AGI diverge from good design principles for, say, self-driving cars.
“Minimizing expected harm done” is a very dangerous design principle for AGIs because of Goodhart’s law / specification gaming. For example, if you define “harm” as “humans dying or becoming injured”, then the AGI will be motivated to imprison all the humans in underground bunkers with padded walls! Worse, how do you write machine code for “expected harm”? It’s not so straightforward…and if there’s any edge case where your code messes up the calculation—i.e., where the expected-harm-calculation-module outputs an answer that diverges from actual expected harm—then the AGI may find that edge-case, and do something even worse than imprison all the humans in underground bunkers with padded walls!
I agree that “being paralyzed by indecision” can be dangerous if the AGI is controlling a moving train. But first of all, this is “everyday dangerous” as opposed to “global catastrophic risk dangerous”, which is what I’m worried about. And second of all, it can be mitigated by, well, not putting such an AGI in control of a moving train!! You don’t need a train-controlling AI to be an AGI with cross-domain knowledge and superhuman creative problem-solving abilities. We can just use normal 2021-type narrow AIs for our trains and cars. I mean, there were already self-driving trains in the 1960s. They’re not perfect, but neither are humans, it’s fine.
Someday we may really want to put an AGI in charge of an extremely complicated fast-moving system like an electric grid—where you’d like cross-domain knowledge and superhuman problem-solving abilities (e.g. to diagnose and solve a problem that’s never occurred before), and where stopping a cascading blackout can require making decisions in a split-second, too fast to put a human in the loop. In that case, “being paralyzed by indecision” is actually a pretty bad failure mode (albeit not necessarily worse than the status quo, since humans can also be paralyzed by indecision).
I would say: We don’t have to solve that problem right now. We can leave humans in charge of the electric grid a bit longer! Instead, let’s build AGIs that can work with humans to dramatically improve our foresight and reasoning abilities. With those AGI assistants by our sides, then we can tackle the question of what’s the best way to control the electric grid! Maybe we’ll come up with a redesigned next-generation AGI that can be safe without being conservative. Or maybe we’ll design a special-purpose electric-grid-controlling narrow AI. Or maybe we’ll even stick with humans! I don’t know, and we don’t have to figure it out now.
In other words, I think that the goal right now is to solve the problem of safe powerful human-in-the-loop AGI, and then we can use those systems to help us think about what to do in cases where we can’t have a human in the loop.
I agree that there’s a sense in which “leave the lever alone” is a decision. However, I’m optimistic that we can program an AGI to treat NOOP as having a special status, so it’s the default output when all options are unpalatable. To be crystal-clear: I’m not claiming that this happens naturally, I’m proposing that this a thing that we deliberately write into the AGI source code. And the reason I want NOOP to have a special status is that an AGI will definitely not cause an irreversible global catastrophe by NOOP’ing. :-)
I think you are missing something critical.
What do we need AGI for that mere 2021 narrow agents can’t do?
The top item we need is for a system that can keep us biologically and mentally alive as long as possible.
Such an AGI is constrained by time and will constantly be in situations where all choices cause some harm to a person.