Superintelligence and wireheading

A putative new idea for AI control; index here.

tl;dr: Even utility-based agents may wirehead if sub-pieces of the algorithm develop greatly improved capabilities, rather than the agent as a whole.

Please let me know if I’m treading on already familiar ground.

I had a vague impression of how wireheading might happen. That it might be a risk for a reinforcement learning agent, keen to take control of its reward channel. But that it wouldn’t be a risk for a utility-based agent, whose utility was described over real (or probable) states of the world. But it seems it might be more complicated than that.

When we talk about a “superintelligent AI”, we’re rather vague on what superintelligence means. We generally imagine that it translates into a specific set of capabilities, but how does that work internally inside the AI? Specifically, where is the superintelligence “located”?

Let’s imagine the AI divided into various submodules or subroutines (the division I use here is for illustration; the AI may be structured rather differently). It has a module I for interpreting evidence and estimating the state of the world. It has another module S for suggesting possible actions or plans (S may take input from I). It has a prediction module P which takes input from S and I and estimates the expected outcome. It has a module V which calculates its values (expected utility/expected reward/violation or not of deontological principles/etc...) based on P’s predictions. Then it has a decision module D that makes the final decision (for expected maximisers, D is normally trivial, but D may be more complicated, either in practice, or simply because the agent isn’t an expected maximiser).

Add some input and output capabilities, and we have a passable model of an agent. Now, let’s make it superintelligent, and see what can go wrong.

We can “add superintelligence” in most of the modules. P is the most obvious: near perfect prediction can make the agent extremely effective. But S also offers possibilities: if only excellent plans are suggested, the agent will perform well. Making V smarter may allow it to avoid some major pitfalls, and a great I may make the job of S and P trivial (the effect of improvements to D depend critically on how much work D is actually doing). Of course, maybe several modules become better simultaneously (it seems likely that I and P, for instance, would share many subroutines); or maybe only certain parts of them do (maybe S becomes great at suggesting scientific experiments, but not conversational responses, or vice versa).

Breaking bad

But notice that, in each case, I’ve been assuming that the modules become better at what they were supposed to be doing. The modules have implicit goals, and have become excellent at that. But the explicit “goals” of the algorithms—the code as written—might be very different from the implicit goals. There are two main ways this could then go wrong.

The first is if the algorithms becomes extremely effective, but the output becomes essentially random. Imagine that, for instance, P is coded using some plausible heuristics and rules of thumb, and we suddenly give P many more resources (or dramatically improve its algorithm). It can look through trillions of times more possibilities, its subroutines start looking through a combinatorial explosion of options, etc… And in this new setting, the heuristics start breaking down. Maybe it has a rough model of what a human can be, and with extra power, it starts finding that rough model all over the place. Thus, predicting that rocks and waterfalls will respond intelligently when queried, P becomes useless.

In most cases, this would not be a problem. The AI would become useless and start doing random stuff. Not a success story, but not a disaster, either. Things are different if the module V is affected, though. If the AI’s value system becomes essentially random, but that AI was otherwise competent—or maybe even superintelligent—it would start performing actions that could be very detrimental. This could be considered a form of wireheading.

More serious, though is if the modules become excellent at achieving their “goals”, as if they were themselves goal-directed agents. Consider module D, for instance. If its task was mainly to pick the action with the highest V rating, and it became adept at predicting the output of V (possibly using P? or maybe it has the ability to ask for more hypothetical options from S, to be assessed via V), it could start to manipulate its actions with the sole purpose of getting high V-ratings. This could include deliberately choosing actions that lead to V giving artificially high ratings in future, to deliberately re-wiring V for that purpose. And, of course, it is now motivated to keep V protected to keep the high ratings flowing in. This is essentially wireheading.

Other modules might fall into the familiar failure patterns for smart AIs—S, P, or I might influence the other modules so that the agent as a whole gets more resources, allowing S, P, or I to better compute their estimates, etc...

So it seems that, depending on the design of the AI, wireheading might still be an issue even for agents that seem immune to it. Good design should avoid the problems, but it has to be done with care.