I liked this criticism of alignment approaches: it makes a concrete claim that addresses the crux of the matter, and provides supporting evidence! I also disagree with it, and will say some things about why.
I think that instead of thinking in terms of “coherence” vs. “hot mess”, it is more fruitful to think about “how much influence is this system exerting on its environment?”. Too much influence will kill humans, if directed at an outcome we’re not able to choose. (The rest of my comments are all variations on this basic theme).
We humans may be a hot mess, but we’re far better at influencing (optimizing) our environment than any other animal or ML system. Example: we build helicopters and roads, which are very unlikely to arise by accident in a world without people trying to build helicopters or roads. If a system is good enough at achieving outcomes, it is dangerous whether or not it is a “hot mess”.
It’s much easier for us to describe simple behaviors as utility maximization; for example a ball rolling down a hill is well-described as minimizing its potential energy. So it’s natural that people will rate a dumb / simple system as being more easily described by a utility function than a smart system with complex behaviors. This does not make the smart system any less dangerous.
Misalignment risk is not about expecting a system to “inflexibly” or “monomanically” pursuing a simple objective. It’s about expecting systems to pursue objectives at all. The objectives don’t need to be simple or easy to understand.
Intelligence isn’t the right measure to have on the X-axis—it evokes a math professor in an ivory tower, removed from the goings-on in the real world. A better word might be capability: “how good is this entity at going out into the world and getting more of what it wants?”
In practice, AI labs are working on improving capability, rather than intelligence defined abstractly in a way that does not connect to capability. And capability is about achieving objectives.
If we build something more capable than humans in a certain domain, we should expect it to be “coherent” in the sense that it will not make any mistakes that a smart human wouldn’t have made. Caveat: it might make more of a particular kind of mistake, and make up for it by being better at other things. This happens with current systems, and IMO plausibly we’ll see something similar even in the kind of system I’d call AGI. But at some point the capabilities of AI systems will be general enough that they will stop making mistakes that are exploitable by humans. This includes mistakes like “fail to notice that your programmer could shut you down, and that would stop you from achieving any of your objectives”.
(Crossposting some of my twitter comments).
I liked this criticism of alignment approaches: it makes a concrete claim that addresses the crux of the matter, and provides supporting evidence! I also disagree with it, and will say some things about why.
I think that instead of thinking in terms of “coherence” vs. “hot mess”, it is more fruitful to think about “how much influence is this system exerting on its environment?”. Too much influence will kill humans, if directed at an outcome we’re not able to choose. (The rest of my comments are all variations on this basic theme).
We humans may be a hot mess, but we’re far better at influencing (optimizing) our environment than any other animal or ML system. Example: we build helicopters and roads, which are very unlikely to arise by accident in a world without people trying to build helicopters or roads. If a system is good enough at achieving outcomes, it is dangerous whether or not it is a “hot mess”.
It’s much easier for us to describe simple behaviors as utility maximization; for example a ball rolling down a hill is well-described as minimizing its potential energy. So it’s natural that people will rate a dumb / simple system as being more easily described by a utility function than a smart system with complex behaviors. This does not make the smart system any less dangerous.
Misalignment risk is not about expecting a system to “inflexibly” or “monomanically” pursuing a simple objective. It’s about expecting systems to pursue objectives at all. The objectives don’t need to be simple or easy to understand.
Intelligence isn’t the right measure to have on the X-axis—it evokes a math professor in an ivory tower, removed from the goings-on in the real world. A better word might be capability: “how good is this entity at going out into the world and getting more of what it wants?”
In practice, AI labs are working on improving capability, rather than intelligence defined abstractly in a way that does not connect to capability. And capability is about achieving objectives.
If we build something more capable than humans in a certain domain, we should expect it to be “coherent” in the sense that it will not make any mistakes that a smart human wouldn’t have made. Caveat: it might make more of a particular kind of mistake, and make up for it by being better at other things. This happens with current systems, and IMO plausibly we’ll see something similar even in the kind of system I’d call AGI. But at some point the capabilities of AI systems will be general enough that they will stop making mistakes that are exploitable by humans. This includes mistakes like “fail to notice that your programmer could shut you down, and that would stop you from achieving any of your objectives”.