Pitfalls for AI Systems via Decision Theoretic Adversaries
This post accompanies a new paper related to AI alignment. A brief outline and informal discussion of the ideas is presented here, but of course, you should check out the paper for the full thing.
As progress in AI continues to advance at a rapid pace, it is important to know how advanced systems will make choices and in what ways they may fail. When thinking about the prospect of superintelligence, I think it’s all too easy and all too common to imagine that an ASI would be something which humans, by definition, can’t ever outsmart. But I don’t think we should take this for granted. Even if an AI system seems very intelligent—potentially even superintelligent—this doesn’t mean that it’s immune to making egregiously bad decisions when presented with adversarial situations. Thus the main insight of this paper:
The Achilles Heel hypothesis: Being a highly-successful goal-oriented agent does not imply a lack of decision theoretic weaknesses in adversarial situations. Highly intelligent systems can stably possess “Achilles Heels” which cause these vulnerabilities.
More precisely, I define an Achilles Heel as a delusion which is impairing (results in irrational choices in adversarial situations), subtle (doesn’t result in irrational choices in normal situations), implantable (able to be introduced) and stable (remaining in a system reliably over time).
In the paper, a total of 8 prime candidates Achilles Heels are considered alongside ways by which they could be exploited and implanted:
Corrigibility
Evidential decision theory
Causal decision theory
Updateful decision theory
Simulational belief
Sleeping beauty assumptions
Infinite temporal models
Aversion to the use of subjective priors
This was all inspired by thinking about how, since paradoxes can often stump humans, they might also fool certain AI systems in ways that we should anticipate. It surveys and augments work in decision theory involving dilemmas and paradoxes in context of this hypothesis and makes a handful of novel contributions involving implantation. My hope is that this will lead to insights on how to better model and build advanced AI. On one hand, Achilles Heels are a possible failure mode which we want to avoid, but on the other, they are an opportunity for building better models via adversarial training or the use of certain Achilles Heels for containment. This paper may also just be a useful reference in general for the topics it surveys.
The Achilles Heel Hypothesis for AI
Pitfalls for AI Systems via Decision Theoretic Adversaries
This post accompanies a new paper related to AI alignment. A brief outline and informal discussion of the ideas is presented here, but of course, you should check out the paper for the full thing.
As progress in AI continues to advance at a rapid pace, it is important to know how advanced systems will make choices and in what ways they may fail. When thinking about the prospect of superintelligence, I think it’s all too easy and all too common to imagine that an ASI would be something which humans, by definition, can’t ever outsmart. But I don’t think we should take this for granted. Even if an AI system seems very intelligent—potentially even superintelligent—this doesn’t mean that it’s immune to making egregiously bad decisions when presented with adversarial situations. Thus the main insight of this paper:
The Achilles Heel hypothesis: Being a highly-successful goal-oriented agent does not imply a lack of decision theoretic weaknesses in adversarial situations. Highly intelligent systems can stably possess “Achilles Heels” which cause these vulnerabilities.
More precisely, I define an Achilles Heel as a delusion which is impairing (results in irrational choices in adversarial situations), subtle (doesn’t result in irrational choices in normal situations), implantable (able to be introduced) and stable (remaining in a system reliably over time).
In the paper, a total of 8 prime candidates Achilles Heels are considered alongside ways by which they could be exploited and implanted:
Corrigibility
Evidential decision theory
Causal decision theory
Updateful decision theory
Simulational belief
Sleeping beauty assumptions
Infinite temporal models
Aversion to the use of subjective priors
This was all inspired by thinking about how, since paradoxes can often stump humans, they might also fool certain AI systems in ways that we should anticipate. It surveys and augments work in decision theory involving dilemmas and paradoxes in context of this hypothesis and makes a handful of novel contributions involving implantation. My hope is that this will lead to insights on how to better model and build advanced AI. On one hand, Achilles Heels are a possible failure mode which we want to avoid, but on the other, they are an opportunity for building better models via adversarial training or the use of certain Achilles Heels for containment. This paper may also just be a useful reference in general for the topics it surveys.
For more info, you’ll have to read it! Also feel free to contact me at scasper@college.harvard.edu.