Multi-dimensional rewards for AGI interpretability and control

Update August 2021: Re-reading this post, I continue to think this is a good and important idea, and I was very happy to learn after I wrote it that what I had in mind here is really a plausible, viable thing to do, even given the cost and performance requirements that people will demand of our future AGIs. I base that belief on the fact that (I now think) the brain does more-or-less exactly what I talk about here (see my post A model of decision-making in the brain), and also on the fact that the machine learning literature also has things like this (see the comments section at the bottom).

~~

(I’m not a reinforcement learning expert—still learning—please call me out in the comments if I’m saying anything stupid, or reinventing wheels. Status: brainstorming.)

It’s not guaranteed, but I strongly expect that a reward signal and value function (a.k.a. reward prediction function) will be an important component of future AGI systems. For example, humans are our one current example of a general intelligence, and every thought we think has a value (reward prediction) in our brain, and we are thinking that thought at least in part because its value is higher than the value of whatever alternative thought we could have thunk instead.

But I think that reward systems as used in the brain (and in today’s model-based RL systems) have room for improvement, in ways that might make it (marginally) less difficult to keep very powerful AGIs under human control. And I have an idea! Before I get to that, I’ll go through the two motivations for this idea—two deficiencies that I see in reward learning systems as they exist in AIs and brains today.

Motivation

Motivation 1: Value functions add essentially nothing to the system’s interpretability

Let’s say I figure out (somehow) that my AGI is currently thinking some metacognitive thought. Is this part of a plan to sabotage my interpretability tools and other control systems? Or is it optimizing its thought processes in a way I would endorse? Don’t expect the value (reward prediction) that the AGI assigned to that thought to give me an answer! In either of these cases, it just says “this thought has high value”. Of course, I can try to get interpretability by looking at other aspects of that thought—what is it connected to in the predictive world model /​ web of knowledge, when has that thought been active in the past, etc. But the value of the thought is entirely useless for interpretability.

Or another example: In rational agents with utility functions, there’s a distinction between final goals and instrumental (sub)goals. In a brain (or brain-like AGI), it’s all mixed up, everything is just “higher value” or “lower value”. Seems like a missed opportunity!

To be clear, I don’t expect that we can build a complete solution to interpretability inside the reward-learning system. Instead, all I’m hoping for is that there’s a way to make that system less than totally useless for interpretability!

Motivation 2: After changing the reward signal, it takes a while for the value function and behavior to “catch up”

This example will be a bit weird but bear with me...

Let’s say an omnipotent alien magically removes “desire to be respected and admired by the people you look up to” from your brain’s reward system, starting now and continuing for the rest of your life. So from this moment on, if you learn that the coolest /​ smartest /​ whatever-est people in the world are very impressed by you, you just feel nothing whatsoever.

What happens? Your behavior would change in response, but I claim it would change very gradually. First, you’re hanging out with your favorite people, and you make a joke, and the joke lands perfectly. Everybody laughs! But instead of feeling great and patting yourself on the back, you just watch them laugh and feel nothing at all inside. Repeat a few times, and you’ll eventually just stop trying to make jokes, and maybe eventually stop wanting to hang out with them in the first place. That’s a pretty direct consequence of the reward function change, but other consequences would be less direct, and the corresponding behavioral updates would take longer to play out. Like, maybe you always watch football. You’ve always thought of yourself as a football fan, and maybe you never even consciously realized that you had developed your self-image as a football fan over the years entirely because it has helped you fit in with your friends. After a while, fitting in no longer feels motivating, and then having a self-image as a football fan no longer feels motivating, and then finally you stop watching football. So just like in those examples, over days and weeks and months, a million little learned habits and quirks fade away, because they all turn out to have been feeding very indirectly off that social reward signal—the one that the alien has now magically deleted.

Metaphorically speaking, it’s like that social reward signal is a river, and your brain’s credit assignment mechanism splits up that river into countless tributaries and streams and creeks and brooks, and each of those tiny brooks creates the fertile soil that supports its own little wildlife ecosystem ( = habits and preferences that you’ve developed in some obscure context). When the river stops flowing, it stops feeding the tributaries, which in turn stop feeding the creeks and brooks, etc. But this process unfolds gradually.

If you prefer RL textbook examples to brain examples, think of how TD learning walks back the credit assignment one timestep at a time. We change the reward function after episode N. In episode N+1, the value function predicts the old reward. In episode N+2, the value function predicts the old reward right up until the action right before the reward signal. In episode N+3 it’s the old reward until the last two steps, etc.

OK, now we want to control an AGI, and let’s say we are successfully controlling it (“steering it”), but we have just decided that some aspect of our reward signal is not quite right, and we want to change it. It is not acceptable to re-train the AGI from scratch—I would be surprised if early AGIs could be trained in less than a year of wall-clock time! What we can do is change the reward signal, and let the AGI’s learned value function gradually adjust itself to the new reward. However, as discussed above, this adjustment could take a long time! And if we made the change for safety reasons, do we really want to wait patiently as the AGI’s motivations gradually change from problematic to better, hoping that nothing goes wrong in the meantime?? I don’t!

When you’re driving a car, it is a critically important safety requirement that when you turn the steering wheel, the wheels respond instantaneously. By the same token, I expect that it will be a critically important safety requirement to be able to change an AGI’s motivations instantaneously when you press the appropriate button.

So those are my two motivations. I think my idea below could help both of those—not solving the problem, but helping on the margin. Before I introduce the idea, we need yet two more points of background.

Background

Background 1: We should expect reward functions to usually be a sum of multiple, meaningfully different, components

I’ve always been assuming that if we make AGIs with reward signals, we want the reward signal to be a sum of lots of components which are meaningfully different from our (the programmer’s) perspective.

A human example would be that the RL part of our brain gets a reward for eating when we’re hungry, and it gets reward for learning that someone we respect likes us back. Those two are different, and as far as I can tell the brain just adds them up (along with dozens of other contributions) to calculate total reward.

In the AGI case, I imagine that early on we would have some educational /​ reward-shaping contributions to total reward, for example a reward for correctly predicting human speech sounds. Then eventually (and maybe also early) we would also be setting up things like verbal commands (“I, the programmer, say something while pressing a button, and whatever thoughts that sentence activates in the AGI are now imbued with positive reward”, or something vaguely like that), or other forms of human feedback and guidance. These feedback and guidance rewards would presumably also have multiple sub-components—like “reward for following the command I issued yesterday” is meaningfully different from “reward for following the command I issued last week”. Again, from an interpretability perspective, it would be valuable to know if an AGI rates a thought or action highly because it expects that thought or action to help fulfill yesterday’s command, or last week’s command, or both.

Background 2: Why do rewards need to be 1D (a.k.a. scalar)?

(Well, sometimes a 1D probability distribution, but let’s gloss over that, it doesn’t affect this discussion.)

As far as I understand right now, there are two reasons that rewards and values need to be scalar. Going back to the brain as an example:

Value comparisons require a scalar: In the brain, we roll out multiple possible thoughts /​ actions /​ plans, and then we need to do a comparison to decide which is better. You need a scalar to enable that comparison.

Policy learning requires a scalar: Since there are an astronomically large number of possible thoughts /​ actions /​ plans, we want to make sure that unpromising options don’t even rise to the level of being considered in those value comparisons above. Since “likelihood of a thought to rise to consideration” is a scalar, the reward needs to be a scalar too.

More on understanding the distinction between these:

  • In the “babble and prune” folk psychology model, the first one is basically using reward information to improve pruning, and the second one is basically using reward information to improve babbling.

  • In AlphaZero, the first one is training the value head, and the second one is training the policy head, if I understand correctly.

  • In the “phasic dopamine = TD learning” model, I would say (tentatively) that the first one is why there are dopamine receptors in the basal ganglia, and the second one is why there are dopamine receptors in the neocortex.

    • More details on this one, albeit very speculative—I’m still reading up on it. In my current understanding, the basal ganglia computes the value function—a reward prediction for any possible thought. Then it combines that with the reward signal to calculate a reward prediction error, and then it updates its own value function by TD learning, and last but not least, it modulates the activity patterns in the neocortex to enhance higher-value thoughts at the expense of lower-value thoughts. That’s all the value part. The policy part is: the neocortex listens for those reward prediction errors mentioned above, and when they’re positive, it strengthens whatever connections are active, making them likelier to recur in the future. And conversely, when the reward prediction errors are negative, it weakens the currently-active connections.

So, again, both of the things we do with reward require the reward to be scalar. Or do they? So finally, we get to my proposal.

Proposal: Multi-dimensional reward, multi-dimensional value function, no change to the policy learning

As above, let’s assume that the reward signal is a sum of many components which are meaningfully different from our (human AI programmer) perspective. Rather than adding up the components into total reward outside of the RL system, instead we feed the RL system all the components separately. Then the RL system constructs a set of value functions that predicts each of the components independently. In deep RL, this would look like a bunch of value outputs instead of just one. In a brain, I think the value calculation is like 99% just a memorized lookup table and 1% calculation (long story, highly speculative, I won’t get into it here), and in that case we just put in additional columns in the lookup table for each of the reward components.

We update the value function with TD learning, as usual. (The TD learning algorithm vectorizes just fine—you have a vector of reward components, and a vector of old values, and a vector of new values. No problem.)

When we need to do value comparisons, we just add up the components to get total value. So the result of the value comparisons is the same as normal. Only the internal representation is different—i.e., more detailed.

Meanwhile, the policy learning is totally unchanged—same as normal. Assuming a brain-like architecture, we would add up all the components of reward-prediction-error in the basal ganglia part of the system to get a total reward prediction error, then that total is used to strengthen or weaken the active connections in the neocortex part of the system.

So far, everything here has created no input-output difference whatsoever in the system.

...So why do it? Just the two reasons at the top:

  • Interpretability: If a thought /​ action has high value, we can see which of the reward components was the ultimate source of that value—no matter how indirect the path from that thought /​ action to the associated reward.

  • Control: If we decide that we were wrong to include one of the reward components, we can just start leaving out that component when we do the value calculations. And then the AI will immediately start doing all its value calculations as if that component of the reward signal had never existed in the first place!

    • More generally, we can put different weights on each component (take the dot product of the value vector with whatever fixed vector we choose), and alter those weights whenever we want.

    • Even more generally, we can apply a nonlinear function to the components. For example, the “minimum” seems potentially useful: “only do something if all the various components of the reward function endorse it as helpful”.

For the control aspect: when we change the reward and value function, we don’t immediately change the policy part, as mentioned above. There policy part will eventually adapt itself to the modified reward, but not immediately. Does that matter, for safety?

Let’s go back to that human example above, where your reward system is suddenly, magically modified to no longer give any reward for “being well-regarded by people you look up to”. If you had instant value-editing but not instant policy-editing, as proposed here, then it would still occur to you to do the things that were previously rewarded, like cracking jokes or watching football (in that example). Maybe sometimes you would do those things without even thinking, by force of habit. But as soon as you so much as entertain the thought of doing those things, you would find it to be an unappealing thought, and you wouldn’t do it.

So I think this plan, with multi-dimensional value but no change to the policy part, is probably sufficient to serve as an effective ingredient of an AGI control system. AGIs are only really dangerous when they’re coming up with new, clever ideas that we programmers didn’t think of, and that kind of planning and brainstorming would almost definitely involve making queries to the value function, I believe (at least in a brain-like architecture).

Is there a cost or performance penalty to do this rather than a scalar reward function? Yes of course, but I really think it would be almost negligible—like a few percent. Or maybe there could even be a performance benefit in certain (deep-RL-type) cases.

Like, having 20 value outputs instead of just one on a deep-RL neural network certainly doesn’t increase the memory or compute requirements much—at least not with the neural network architectures people usually use these days, where the connections from the final layer to the output are a very small fraction of the total number of connections. And what about fitting accuracy? Well, if anything I would expect the fitting accuracy to improve, because we’ve helped out the network by splitting up things that are semantically unrelated. It’s like we’re giving the network a big hint about how the reward signal works.

Or if we’re thinking of brain-like systems, if I’m correct that the basal ganglia is responsible for storing and calculating the value function—then, well, it seems like the relevant part of the basal ganglia takes up like 100× less volume than the neocortex, or something like that. The brain’s value function (as I understand it) is just not that complicated—not compared to all the other things that the neocortex subsystem has to do, like creating and editing a giant predictive world-model, using it for inference, updating it by self-supervised learning, etc. So learning a 20-dimensional value function instead of a 1-dimensional value function is just not much extra computational cost as a fraction of the whole system, as far as I understand it.

(Update: I now think the human brain does more-or-less exactly this; see A model of decision-making in the brain (the short version).)

Why not put in multi-dimensional policy-learning too?

Well, sure, you can, but now it’s not a few percent computational cost penalty, but a factor-of-several computational cost penalty, I would think. Like, in the brain-like case, you gradually learn and cache sequences of thoughts and actions that are rewarding, and build on them, and continually prune out less-rewarding alternatives. Different reward functions would thus seem to essentially require building multiple neocortex’s in parallel—and with no obvious way to interpolate between them. Or in the deep-RL case, think of how you need on-policy learning; that’s impossible with multiple rewards at once.

Anyway, as described above, I don’t see much if any additional safety benefits of going down this path—let alone enough extra benefit to justify a factor-of-several computational cost penalty.

Previous literature

I did an extremely cursory search for people having already looked into this kind of system, discussed it, or better yet tested that it would actually work. No such luck! I did find some papers involving learning from multiple reward signals (e.g. 1, 2, 3), but all seem only superficially related to what I have in mind, and not directly applicable to AGI safety the way I’m hoping this blog post is.

I’m guessing that this proposal is really mainly applicable to AGI safety, and not all that terribly useful for RL as it is typically practiced at universities and companies today. Either that, or else this proposal is stupid, or else I haven’t found the right corner of the literature yet.

Discussion prompts

Did you understand all that? What are you confused about? What am I confused about? Would it work at all? Would it help? What could go wrong? How can we make it better? Please discuss! :-)