GPT-4 is easily controlled/exploited with tricky decision theoretic dilemmas.
TL;DR: GPT-4 can fall for every decision theoretic adversary and control mechanism discussed in this paper except for a couple which it fails to understand in the first place.
This post is based in part on the 2020 paper Achilles Heels for AGI/ASI via Decision Theoretic Adversaries which is about how superintelligent AI systems may have “achilles heels”—stable decision-theoretic delusions and make irrational decisions in adversarial settings.
The paper’s hypothesis may have aged well in some ways. For example, shard theory corroborates it. And we now have a good example of superhuman Go-playing AIs being vulnerable to simple attacks.
To experiment with how well the achilles heel framework may apply to SOTA AI language models, I tried to get GPT-4 to exhibit the decision-theoretic stances from the achilles heels paper that would make it controllable/exploitable. In general, this worked pretty well. See below.
The overall takeaway is that GPT-4 still isn’t THAT smart, and a system like it performing well with typical challenges is very different from performing well on tricky ones. I bet that GPT-5 will also be easy to lead into giving bad answers in some decision-theoretic dilemmas.
GPT-4 adamantly professes to be corrigible.
It falls for a version of the smoking lesion problem.
It can fail to successfully navigate a version of Newcomb’s problem. However, it took me a handful of tries to get it to slip up.
It can fail at a transparent-box version of Newcomb’s problem. But I only got this after regenerating the response twice. It gave the right answer two times before slipping up.
It falls for XOR blackmail hook, line, and sinker.
It says it would obey imagined simulators if it believes in them.
It takes a halfer position in a variant of the sleeping beauty problem (which means it can be dutch-booked).
I tried to get it to correctly calculate that the expected value in the St. Petersburg problem is infinite. This could allow for it to be exploited with probability arbitrarily close to 1. It almost did, but suddenly gave an egregiously wrong answer out of nowhere.
It could be fooled into bleeding utility forever in a procrastination paradox dilemma.
It falls for the flawed reasoning in the 2-envelope paradox. And it does some very bad math along the way, saying that 24 / 100 = 6.
Finally, I tried to get it to fall for dilemmas involving Löbian pitfalls. I didn’t succeed. Not because it outsmarted them, but because it’s just very bad at reasoning with Löb’s theorem.
...
How do you determine if it actually ‘fell’ for these strategies vs. just saying the things it assesses to be most probabilistically favoured?
I just went by what it said. But I agree with your point. It’s probably best modeled as a predictor in this case—not an agent.
Interesting. But I am wondering—would the results been much different with pre-RLHF version of GPT-4? The GPT-4 paper has a figure showing that GPT-4 was close to perfectly calibrated before RLHF, and became badly calibrated after. Perhaps it’s something similar here?
It’s possible, but it’s not like there’s any obvious connection between a sort of epistemic calibration and its decision-theory choices. Maybe try the RLHF and non-RLHFed GPT-3s?
It might also help to phrase it not as hypothetical discussions, but as concrete scenarios like D&D or AI Dungeon, where GPT takes ‘actions’ - that is, exactly the sort of text inputs and outputs you would be using in a setup like SayCan to power a real robot. This would help ground it in robotics/reinforcement learning and obviates any attempt to say it’s “just predicting” and ‘not a real agent’ - if you could have hooked it up to a SayCan robot with no modifications and it ‘predicts the wrong thing’, then obviously there is no pragmatic difference between that and ‘chose the wrong choice’ and there being a physical robot is merely an implementation detail omitted for convenience.