“Destroy the world” doesn’t seem to be a big problem to me. Paul’s (proposed) AGI can be viewed as not directly caring about our world, but only about a world/computation defined by H and T (let’s call that HT World). If it can figure out that its preferences for HT World can be best satisfied by it performing actions that (as a side effect) cause it to take over our world, then it seems likely that it can also figure out that it should take over our world in a non-destructive way. I’m more worried about whether (given realistic amounts of initial computing power) it would manage to do anything at all.
I’m not talking about Paul’s proposal in particular, but about eventually-Friendly AIs in general. Their defining feature is that they have correct Friendly goal given by a complicated definition that leaves a lot of logical uncertainty about the goal until it’s eventually made more explicit. So we might explore the neighborhood of normal FAIs, increasing the initial logical uncertainty about their goal, so that they become more and more prone to initial pursuit of generic instrumental gains at the expense of what they eventually realize to be their values.
Well, yes, but I interpreted the problem of impossibly complicated value definition as the eFAI* (which does seem to be a problem with Paul’s specific proposal, even if we assume that it theoretically converges to a FAI) never coming out of its destructive phase, and hence possibly just eating the universe without producing anything of value, so “destroy the world” is in a sense the sole manifestation of the problem with a hypothetical implementation of that proposal...
[* eFAI = eventually-Friendly AI, let’s coin this term]
“Destroy the world” doesn’t seem to be a big problem to me. Paul’s (proposed) AGI can be viewed as not directly caring about our world, but only about a world/computation defined by H and T (let’s call that HT World). If it can figure out that its preferences for HT World can be best satisfied by it performing actions that (as a side effect) cause it to take over our world, then it seems likely that it can also figure out that it should take over our world in a non-destructive way. I’m more worried about whether (given realistic amounts of initial computing power) it would manage to do anything at all.
I’m not talking about Paul’s proposal in particular, but about eventually-Friendly AIs in general. Their defining feature is that they have correct Friendly goal given by a complicated definition that leaves a lot of logical uncertainty about the goal until it’s eventually made more explicit. So we might explore the neighborhood of normal FAIs, increasing the initial logical uncertainty about their goal, so that they become more and more prone to initial pursuit of generic instrumental gains at the expense of what they eventually realize to be their values.
Oh, please reinterpret my comment as replying to this comment of yours. (That one is specifically talking about Paul’s proposal, right?)
Well, yes, but I interpreted the problem of impossibly complicated value definition as the eFAI* (which does seem to be a problem with Paul’s specific proposal, even if we assume that it theoretically converges to a FAI) never coming out of its destructive phase, and hence possibly just eating the universe without producing anything of value, so “destroy the world” is in a sense the sole manifestation of the problem with a hypothetical implementation of that proposal...
[* eFAI = eventually-Friendly AI, let’s coin this term]