What if we just give it the value of “be altruistic (in a preference utilitarian sense) towards (some group of) humans”?
Well, then you get the standard “the best thing to do in a preference utilitarian sense would be to reprogram everyone to only prefer things that are maximally easy to satisfy” objection, and once you start trying to avoid that, you get the full complexity of value problem again.
The standard solution to that is to be altruistic to some group of people as they existed at time T, and the standard problem with that is it doesn’t allow moral progress, and the standard solution to that is to be altruistic to some idealized or extrapolated group of people. So we just have to make the heuristics-based FAI understand the concept of CEV (or whatever the right notion of “idealized” is), which doesn’t seem impossible. What does seem impossible is to achieve high confidence that it understands the notion correctly, but if provably-Friendly AI is just too slow or unfeasible, and we’re not trying to achieve 100% safety...
I thought that too until I spent a few hours thinking about how to actually implement CEV, after which I realized that any AI capable of using that monster of an algorithm is already a superintelligence (and probably turned the Earth into computronium while it was trying to get enough CPU power to bootstrap its goal system).
Anyone who wants to try a “build moderately smart AGI to help design the really dangerous AGI” approach is probably better off just making a genie machine (i.e. an AI that just does whatever its told, and doesn’t have explicit goals independent of that). At least that way the failure modes are somewhat predictable, and you can probably get to a decent multiple of human intelligence before accidentally killing everyone.
I don’t see how you can build a human-level intelligence without making it at least somewhat consequentialist. If it doesn’t decide actions based on something like expected utility maximization, how does it decide actions?
What I was referring to is the difference between:
A) An AI that accepts an instruction from the user, thinks about how to carry out the instruction, comes up with a plan, checks that the user agrees that this is a good plan, carries it out, then goes back to an idle loop.
B) An AI that has a fully realized goal system that has some variant of ‘do what I’m told’ implemented as a top-level goal, and spends its time sitting around waiting for someone to give it a command so it can get a reward signal.
Either AI will kill you (or worse) in some unexpected way if it’s a full-blown superintelligence. But option B has all sorts of failure modes that don’t exist in option A, because of that extra complexity (and flexibility) in the goal system. I wouldn’t trust a type B system with the IQ of a monkey, because it’s too likely to find some hilariously undesirable way of getting its goal fulfilled. But a type A system could probably be a bit smarter than its user without causing any disasters, as long as it doesn’t unexpectedly go FOOOM.
Of course, there’s a sense in which you could say that the type A system doesn’t have human-level intelligence no matter how impressive its problem-solving abilities are. But if all you’re looking for is an automated problem-solving tool that’s not really an issue.
Well, then you get the standard “the best thing to do in a preference utilitarian sense would be to reprogram everyone to only prefer things that are maximally easy to satisfy” objection, and once you start trying to avoid that, you get the full complexity of value problem again.
The standard solution to that is to be altruistic to some group of people as they existed at time T, and the standard problem with that is it doesn’t allow moral progress, and the standard solution to that is to be altruistic to some idealized or extrapolated group of people. So we just have to make the heuristics-based FAI understand the concept of CEV (or whatever the right notion of “idealized” is), which doesn’t seem impossible. What does seem impossible is to achieve high confidence that it understands the notion correctly, but if provably-Friendly AI is just too slow or unfeasible, and we’re not trying to achieve 100% safety...
I thought that too until I spent a few hours thinking about how to actually implement CEV, after which I realized that any AI capable of using that monster of an algorithm is already a superintelligence (and probably turned the Earth into computronium while it was trying to get enough CPU power to bootstrap its goal system).
Anyone who wants to try a “build moderately smart AGI to help design the really dangerous AGI” approach is probably better off just making a genie machine (i.e. an AI that just does whatever its told, and doesn’t have explicit goals independent of that). At least that way the failure modes are somewhat predictable, and you can probably get to a decent multiple of human intelligence before accidentally killing everyone.
I don’t see how you can build a human-level intelligence without making it at least somewhat consequentialist. If it doesn’t decide actions based on something like expected utility maximization, how does it decide actions?
What I was referring to is the difference between:
A) An AI that accepts an instruction from the user, thinks about how to carry out the instruction, comes up with a plan, checks that the user agrees that this is a good plan, carries it out, then goes back to an idle loop.
B) An AI that has a fully realized goal system that has some variant of ‘do what I’m told’ implemented as a top-level goal, and spends its time sitting around waiting for someone to give it a command so it can get a reward signal.
Either AI will kill you (or worse) in some unexpected way if it’s a full-blown superintelligence. But option B has all sorts of failure modes that don’t exist in option A, because of that extra complexity (and flexibility) in the goal system. I wouldn’t trust a type B system with the IQ of a monkey, because it’s too likely to find some hilariously undesirable way of getting its goal fulfilled. But a type A system could probably be a bit smarter than its user without causing any disasters, as long as it doesn’t unexpectedly go FOOOM.
Of course, there’s a sense in which you could say that the type A system doesn’t have human-level intelligence no matter how impressive its problem-solving abilities are. But if all you’re looking for is an automated problem-solving tool that’s not really an issue.