Just before reading this, I got a shower thought that most AI-related catastrophes described previously were of “hyper-rational” type, e.g. the paperclipper, which from first principles decides that it must produce infinitely many paperclips.
However, this is not how ML-based systems fail. They either fall randomly, when encounter something like adversarial example, or fall slowly, by goodhearting some performance measure. Such systems could be also used to create dangerous weapons, e.g. fakenews or viruses, or interact unpredictably with each other.
Future GPT-3 will be protected from hyper-rational failures because of the noisy nature of its answers, so it can’t stick forever to some wrong policy.
I think that’s a straw man of the classic AI-related catastrophe scenarios. Bostrom’s “covert preparation” --> “Treacherous turn” --> “takeover” story maps pretty nicely to Paul’s “seek influence via gaming tests” --> “they are now more interested in controlling influence after the resulting catastrophe then continuing to play nice with existing institutions and incentives” --> ” One day leaders may find that despite their nominal authority they don’t actually have control over what these institutions do. For example, military leaders might issue an order and find it is ignored. This might immediately prompt panic and a strong response, but the response itself may run into the same problem, and at that point the game may be up. ”
the paperclipper, which from first principles decides that it must produce infinitely many paperclips
I don’t think this is an accurate description of the paperclip scenario, unless “first principles” means “hardcoded goals”.
Future GPT-3 will be protected from hyper-rational failures because of the noisy nature of its answers, so it can’t stick forever to some wrong policy.
Ignoring how GPT isn’t agentic and handwaving an agentic analogue, I don’t think this is sound. Wrong policies make up almost all of policyspace; the problem is not that the AI might enter a special state of wrongness, it’s that the AI might leave the special state of correctness. And to the extent that GPT is hindered by its randomness, it’s unable to carry out long-term plans at all—it’s safe only because it’s weak.
Just before reading this, I got a shower thought that most AI-related catastrophes described previously were of “hyper-rational” type, e.g. the paperclipper, which from first principles decides that it must produce infinitely many paperclips.
However, this is not how ML-based systems fail. They either fall randomly, when encounter something like adversarial example, or fall slowly, by goodhearting some performance measure. Such systems could be also used to create dangerous weapons, e.g. fakenews or viruses, or interact unpredictably with each other.
Future GPT-3 will be protected from hyper-rational failures because of the noisy nature of its answers, so it can’t stick forever to some wrong policy.
I think that’s a straw man of the classic AI-related catastrophe scenarios. Bostrom’s “covert preparation” --> “Treacherous turn” --> “takeover” story maps pretty nicely to Paul’s “seek influence via gaming tests” --> “they are now more interested in controlling influence after the resulting catastrophe then continuing to play nice with existing institutions and incentives” --> ” One day leaders may find that despite their nominal authority they don’t actually have control over what these institutions do. For example, military leaders might issue an order and find it is ignored. This might immediately prompt panic and a strong response, but the response itself may run into the same problem, and at that point the game may be up. ”
I don’t think this is an accurate description of the paperclip scenario, unless “first principles” means “hardcoded goals”.
Ignoring how GPT isn’t agentic and handwaving an agentic analogue, I don’t think this is sound. Wrong policies make up almost all of policyspace; the problem is not that the AI might enter a special state of wrongness, it’s that the AI might leave the special state of correctness. And to the extent that GPT is hindered by its randomness, it’s unable to carry out long-term plans at all—it’s safe only because it’s weak.