I know the answer to “couldn’t you just-” is always “no”, but couldn’t you just make an AI that doesn’t try very hard? i.e., it seeks the smallest possible intervention that ensures 95% chance of whatever goal it’s intended for.
This isn’t a utility maximizer, because it cares about intermediate states. Some of the coherence theorems wouldn’t apply.
The only bound on incoherence is ability to survive. So most of alignment is not about EU maximizers, it’s about things that might eventually build something like EU maximizers, and the way they would be formulating their/our values. If there is reliable global alignment security, preventing rival misaligned agents from getting built where they have a fighting chance, anywhere in the world, then the only thing calling for transition to better agent foundations is making more efficient use of the cosmos, bringing out more potential for values of the current civilization.
“Hard problem of corrigibility” refers to Problem of fully updated deference—Arbital, which uses a simplification (human preferences can be described as a utility function) that can be inappropriate for the problem. Human preferences are obviously path-dependent (you don’t want to be painfully disassembled and reconstituted as a perfectly happy person with no memory of disassembly). Was appropriateness of the above simplification discussed somewhere?
It’s mentioned there as an example of a thing that doesn’t seem to work. Simplifications are often appropriate as a way of making a problem tractable, even if the analogy is lost and the results are inapplicable to the original problem. Such exercises occasionally produce useful insights in unexpected ways.
Human preference, as practiced by humans, is not the sort of thing that’s appropriate to turn into a utility function in any direct way. Hence things like CEV, gesturing at the sort of processes that might have any chance of doing something relevant to turning humans into goals for strong agents. Any real attempt should involve a lot of thinking from many different frames, probably an archipelago of stable civilizations running for a long time, foundational theory on what kinds of things idealized preference is about, and this might still fail to go anywhere at human level of intelligence. The thing that can actually be practiced right now is the foundational theory, the nature of agency and norms, decision making and coordination.
If there is a problem you can’t solve, then there is an easier problem you can solve: find it.
—George Pólya
I know the answer to “couldn’t you just-” is always “no”, but couldn’t you just make an AI that doesn’t try very hard? i.e., it seeks the smallest possible intervention that ensures 95% chance of whatever goal it’s intended for.
This isn’t a utility maximizer, because it cares about intermediate states. Some of the coherence theorems wouldn’t apply.
The only bound on incoherence is ability to survive. So most of alignment is not about EU maximizers, it’s about things that might eventually build something like EU maximizers, and the way they would be formulating their/our values. If there is reliable global alignment security, preventing rival misaligned agents from getting built where they have a fighting chance, anywhere in the world, then the only thing calling for transition to better agent foundations is making more efficient use of the cosmos, bringing out more potential for values of the current civilization.
(See also: hard problem of corrigibility, mild optimization, cosmic endowment, CEV.)
“Hard problem of corrigibility” refers to Problem of fully updated deference—Arbital, which uses a simplification (human preferences can be described as a utility function) that can be inappropriate for the problem. Human preferences are obviously path-dependent (you don’t want to be painfully disassembled and reconstituted as a perfectly happy person with no memory of disassembly). Was appropriateness of the above simplification discussed somewhere?
It’s mentioned there as an example of a thing that doesn’t seem to work. Simplifications are often appropriate as a way of making a problem tractable, even if the analogy is lost and the results are inapplicable to the original problem. Such exercises occasionally produce useful insights in unexpected ways.
Human preference, as practiced by humans, is not the sort of thing that’s appropriate to turn into a utility function in any direct way. Hence things like CEV, gesturing at the sort of processes that might have any chance of doing something relevant to turning humans into goals for strong agents. Any real attempt should involve a lot of thinking from many different frames, probably an archipelago of stable civilizations running for a long time, foundational theory on what kinds of things idealized preference is about, and this might still fail to go anywhere at human level of intelligence. The thing that can actually be practiced right now is the foundational theory, the nature of agency and norms, decision making and coordination.