the paperclipper, which from first principles decides that it must produce infinitely many paperclips
I don’t think this is an accurate description of the paperclip scenario, unless “first principles” means “hardcoded goals”.
Future GPT-3 will be protected from hyper-rational failures because of the noisy nature of its answers, so it can’t stick forever to some wrong policy.
Ignoring how GPT isn’t agentic and handwaving an agentic analogue, I don’t think this is sound. Wrong policies make up almost all of policyspace; the problem is not that the AI might enter a special state of wrongness, it’s that the AI might leave the special state of correctness. And to the extent that GPT is hindered by its randomness, it’s unable to carry out long-term plans at all—it’s safe only because it’s weak.
I don’t think this is an accurate description of the paperclip scenario, unless “first principles” means “hardcoded goals”.
Ignoring how GPT isn’t agentic and handwaving an agentic analogue, I don’t think this is sound. Wrong policies make up almost all of policyspace; the problem is not that the AI might enter a special state of wrongness, it’s that the AI might leave the special state of correctness. And to the extent that GPT is hindered by its randomness, it’s unable to carry out long-term plans at all—it’s safe only because it’s weak.