I’m not talking about Paul’s proposal in particular, but about eventually-Friendly AIs in general. Their defining feature is that they have correct Friendly goal given by a complicated definition that leaves a lot of logical uncertainty about the goal until it’s eventually made more explicit. So we might explore the neighborhood of normal FAIs, increasing the initial logical uncertainty about their goal, so that they become more and more prone to initial pursuit of generic instrumental gains at the expense of what they eventually realize to be their values.
Well, yes, but I interpreted the problem of impossibly complicated value definition as the eFAI* (which does seem to be a problem with Paul’s specific proposal, even if we assume that it theoretically converges to a FAI) never coming out of its destructive phase, and hence possibly just eating the universe without producing anything of value, so “destroy the world” is in a sense the sole manifestation of the problem with a hypothetical implementation of that proposal...
[* eFAI = eventually-Friendly AI, let’s coin this term]
I’m not talking about Paul’s proposal in particular, but about eventually-Friendly AIs in general. Their defining feature is that they have correct Friendly goal given by a complicated definition that leaves a lot of logical uncertainty about the goal until it’s eventually made more explicit. So we might explore the neighborhood of normal FAIs, increasing the initial logical uncertainty about their goal, so that they become more and more prone to initial pursuit of generic instrumental gains at the expense of what they eventually realize to be their values.
Oh, please reinterpret my comment as replying to this comment of yours. (That one is specifically talking about Paul’s proposal, right?)
Well, yes, but I interpreted the problem of impossibly complicated value definition as the eFAI* (which does seem to be a problem with Paul’s specific proposal, even if we assume that it theoretically converges to a FAI) never coming out of its destructive phase, and hence possibly just eating the universe without producing anything of value, so “destroy the world” is in a sense the sole manifestation of the problem with a hypothetical implementation of that proposal...
[* eFAI = eventually-Friendly AI, let’s coin this term]