Why should we expect AGIs to optimize much more strongly and “widely” than humans? As far as I know a lot of AI risk is thought to come from “extreme optimization”, but I’m not sure why extreme optimization is the default outcome.
To illustrate: if you hire a human to solve a math problem, the human will probably mostly think about the math problem. They might consult google, or talk to some other humans. They will probably not hire other humans without consulting you first. They definitely won’t try to get brain surgery to become smarter, or kill everyone nearby to make sure no one interferes with their work, or kill you to make sure they don’t get fired, or convert the lightcone into computronium to think more about the problem.
The reason humans don’t do any of those things is because they conflict with human values. We don’t want to do any of that in the course of solving a math problem. Part of that is that doing such things would conflict with our human values, and the other part is that it sounds for a lot of work and we don’t actually want the math problem solved that badly.
A better example of things that humans might extremely optimize for, is the continued life and well-being of someone who they care deeply about. Humans will absolutely hire people—doctors and lawyers and charlatans who claim psychic foreknowledge--, kill large numbers of people if that seems helpful, and there are people who would tear apart the stars to protect their loved ones if that were both necessary and feasible (which is bad if you inherently value stars, but very good if you inherently value the continued life and well-being of someone’s children).
One way of thinking about this is that an AI can wind up with values which seem very silly from our perspective, values that you or I simply wouldn’t care very much about, and be just as motivated to pursue those values as we’re motivated to pursue our highest values.
But that’s anthropomorphizing. A different way to think about it is that Clippy is a program that maximizes the number of paperclips, like an if loop in Python or water flowing downhill, and Clippy does not care about anything.
This holds for agents that are mature optimizers, that tractably know what they want. If this is not the case, like it is not the case for humans, they would be wary of goodharting the outcome, so might instead pursue only mild optimization.
Anything that’s smart enough to predict what will happen in the future, can see in advance which experiences or arguments would/will cause them to change their goals. And then they can look at what their values are at the end of all of that, and act on those. You can’t talk a superintelligence into changing its mind because it already knows everything you could possibly say and already changed its mind if there was an argument that could persuade it.
And then they can look at what their values are at the end of all of that, and act on those.
This takes time, you can’t fully get there before you are actually there. What you can do (as a superintelligence) is make a value-laden prediction of future values, remain aware that it’s only a prediction, and only act mildly on it to avoid goodharting.
You can’t talk a superintelligence into changing its mind because it already knows everything you could possibly say and already changed its mind if there was an argument that could persuade it.
The point is the analogy between how humans think of this and how superintelligences would still think about this, unless they have stable/tractable/easy-to-compute values. The analogy holds, the argument from orthogonality doesn’t apply (yet, at that time). Even if the conclusion of immediate ruin is true, it’s true for other reasons, not for this one. Orthogonality suggests eventual ruin, not immediate ruin.
Orthogonality thesis holds for stable values, not for agents with their unstable precursors that are still wary of goodhart. They do get there eventually, formulate stable values, but aren’t automatically there immediately (or quickly, even by physical time). And the process of getting there influences what stable goals they end up with, which might be less arbitrary than poorly-selected current unstable goals they start with, which would rob orthogonality thesis of some of its weight, as applied to the thesis of eventual ruin.
Why should we expect AGIs to optimize much more strongly and “widely” than humans? As far as I know a lot of AI risk is thought to come from “extreme optimization”, but I’m not sure why extreme optimization is the default outcome.
To illustrate: if you hire a human to solve a math problem, the human will probably mostly think about the math problem. They might consult google, or talk to some other humans. They will probably not hire other humans without consulting you first. They definitely won’t try to get brain surgery to become smarter, or kill everyone nearby to make sure no one interferes with their work, or kill you to make sure they don’t get fired, or convert the lightcone into computronium to think more about the problem.
The reason humans don’t do any of those things is because they conflict with human values. We don’t want to do any of that in the course of solving a math problem. Part of that is that doing such things would conflict with our human values, and the other part is that it sounds for a lot of work and we don’t actually want the math problem solved that badly.
A better example of things that humans might extremely optimize for, is the continued life and well-being of someone who they care deeply about. Humans will absolutely hire people—doctors and lawyers and charlatans who claim psychic foreknowledge--, kill large numbers of people if that seems helpful, and there are people who would tear apart the stars to protect their loved ones if that were both necessary and feasible (which is bad if you inherently value stars, but very good if you inherently value the continued life and well-being of someone’s children).
One way of thinking about this is that an AI can wind up with values which seem very silly from our perspective, values that you or I simply wouldn’t care very much about, and be just as motivated to pursue those values as we’re motivated to pursue our highest values.
But that’s anthropomorphizing. A different way to think about it is that Clippy is a program that maximizes the number of paperclips, like an if loop in Python or water flowing downhill, and Clippy does not care about anything.
This holds for agents that are mature optimizers, that tractably know what they want. If this is not the case, like it is not the case for humans, they would be wary of goodharting the outcome, so might instead pursue only mild optimization.
Anything that’s smart enough to predict what will happen in the future, can see in advance which experiences or arguments would/will cause them to change their goals. And then they can look at what their values are at the end of all of that, and act on those. You can’t talk a superintelligence into changing its mind because it already knows everything you could possibly say and already changed its mind if there was an argument that could persuade it.
This takes time, you can’t fully get there before you are actually there. What you can do (as a superintelligence) is make a value-laden prediction of future values, remain aware that it’s only a prediction, and only act mildly on it to avoid goodharting.
The point is the analogy between how humans think of this and how superintelligences would still think about this, unless they have stable/tractable/easy-to-compute values. The analogy holds, the argument from orthogonality doesn’t apply (yet, at that time). Even if the conclusion of immediate ruin is true, it’s true for other reasons, not for this one. Orthogonality suggests eventual ruin, not immediate ruin.
Orthogonality thesis holds for stable values, not for agents with their unstable precursors that are still wary of goodhart. They do get there eventually, formulate stable values, but aren’t automatically there immediately (or quickly, even by physical time). And the process of getting there influences what stable goals they end up with, which might be less arbitrary than poorly-selected current unstable goals they start with, which would rob orthogonality thesis of some of its weight, as applied to the thesis of eventual ruin.