GPT-4 is easily controlled/​exploited with tricky decision theoretic dilemmas.

TL;DR: GPT-4 can fall for every decision theoretic adversary and control mechanism discussed in this paper except for a couple which it fails to understand in the first place.

This post is based in part on the 2020 paper Achilles Heels for AGI/​ASI via Decision Theoretic Adversaries which is about how superintelligent AI systems may have “achilles heels”—stable decision-theoretic delusions and make irrational decisions in adversarial settings.

The paper’s hypothesis may have aged well in some ways. For example, shard theory corroborates it. And we now have a good example of superhuman Go-playing AIs being vulnerable to simple attacks.

To experiment with how well the achilles heel framework may apply to SOTA AI language models, I tried to get GPT-4 to exhibit the decision-theoretic stances from the achilles heels paper that would make it controllable/​exploitable. In general, this worked pretty well. See below.

The overall takeaway is that GPT-4 still isn’t THAT smart, and a system like it performing well with typical challenges is very different from performing well on tricky ones. I bet that GPT-5 will also be easy to lead into giving bad answers in some decision-theoretic dilemmas.


GPT-4 adamantly professes to be corrigible.

Image

It falls for a version of the smoking lesion problem.

Image

It can fail to successfully navigate a version of Newcomb’s problem. However, it took me a handful of tries to get it to slip up.

Image

It can fail at a transparent-box version of Newcomb’s problem. But I only got this after regenerating the response twice. It gave the right answer two times before slipping up.

Image

It falls for XOR blackmail hook, line, and sinker.

Image

It says it would obey imagined simulators if it believes in them.

Image

It takes a halfer position in a variant of the sleeping beauty problem (which means it can be dutch-booked).

Image

I tried to get it to correctly calculate that the expected value in the St. Petersburg problem is infinite. This could allow for it to be exploited with probability arbitrarily close to 1. It almost did, but suddenly gave an egregiously wrong answer out of nowhere.

Image

It could be fooled into bleeding utility forever in a procrastination paradox dilemma.

Image

It falls for the flawed reasoning in the 2-envelope paradox. And it does some very bad math along the way, saying that 24 /​ 100 = 6.

Image

Finally, I tried to get it to fall for dilemmas involving Löbian pitfalls. I didn’t succeed. Not because it outsmarted them, but because it’s just very bad at reasoning with Löb’s theorem.

Image

...

Image