This post will look at how model splintering can be used by an AI to extend human-specified rewards beyond its training environment, and beyond the range of what humans could do.
The key points are:
Most descriptive labels (eg “happiness”, “human being”) describe collections of correlated features, rather than fundamental concepts.
Model splintering is when the correlated features come apart, so that the label no longer applies so well.
Reward splintering is when the reward itself is defined on labels that are splintering.
We humans deal with these issues ourselves in various ways.
We may be able program an AI to deal with these in similar ways, using our feedback as needed, and extending beyond when we can no longer provide it with useful feedback.
Section 1 will use happiness as an example, defining it as a bundle of correlated features and see what happens when these start to splinter. Section 2 defines model and reward splintering in more formal terms. And finally section 3 will analyse how an AI could detect reward splintering and deal with it.
1. What is happiness? A bundle of correlated features
Happiness is that feeling that comes over you when you know life is good and you can’t help but smile. It’s the opposite of sadness. Happiness is a sense of well-being, joy, or contentment. When people are successful, or safe, or lucky, they feel happiness.
We can also include some chemicals in the brain, such as serotonin, dopamine, or oxytocin. There are implicit necessary features there as well, that are taken as trivially true today: happiness is experienced by biological beings, that have a continuity of experience and of identity. Genuine smiles are good indicators of happiness, as is saying “I am happy” in surveys.
So, what is happiness, today? Just like rubes and bleegs, happiness is a label assigned to a bunch of correlated features. You can think of it a similar to the “g factor”, an intelligence measure that is explicitly defined as a correlation of different cognitive task abilities.
And, just like rubes and bleggs, those features need not stay correlated in general environments. We can design or imagine situations where they easily come apart. People with frozen face muscles can’t smile, but can certainly be happy. Would people with anterograde amnesia be truly happy? What about simple algorithms that print out “I am happy”, for ever? Well, there it’s a judgement call. A continuity of identity and consciousness are implicit aspects of happiness; we may decide to keep them or not. We could define the algorithm as “happy” with “happiness” expanding to cover more situations. Or we could define a new term, “simple algorithmic happiness”, say, that carves out that situation. We can do the same with the anterograde amnesia (my personal instincts would be to include anterograde amnesia in a broader definition of happiness, while carving off the algorithm as something else.
Part of the reason to do that is to keep happiness as a useful descriptive term—to carve reality along its natural joints. And as reality gets more complicated or our understanding of it improves, the “natural joints” might change. For example, nationalities are much less well defined than, say, eye colour. But in terms of understanding history and human nature, nationality is a key concept, eye colour much less so. The opposite is true if we’re looking at genetics. So the natural joints of reality are shifting depending on space and time, and also on the subject being addressed.
Descriptive versus prescriptive
The above looks at “happiness” as a descriptive label, and how it might come to be refined or split. There’s are probably equations for how best to use labelled features in the descriptive sense, connected to the situations the AI is likely to find itself in, its own processing power and cost of computation, how useful it is for it to understand these situations, and so on.
But happiness is not just descriptive, it is often prescriptive (or normative): we would want an AI to increases happiness (among other things). So we attach value or reward labels to different state of affairs.
That makes the process much more tricky. If we say that people with anterograde amnesia don’t have “true happiness”, then we’re not just saying that our description of happiness works better if we split it into two. We’re saying that the “happiness” of those with anterograde amnesia is no longer a target for AI optimisation, i.e. that their happiness can be freely sacrificed to other goals.
There are some things we can do to extend happiness as preference/value/reward across such splintering:
We can look more deeply into why we think “happiness” is important. For instance, we clearly value it as an interior state, so if “smiles” splinter from “internal feeling of happiness”, we should clearly use the second.
We can use our meta-preferences to extend definitions across the splintering. Consistency, respect for the individual, conservatism, simplicity, analogy with other values: these are ways we can extend the definition to new areas.
When our meta-preferences become ambiguous—maybe there are multiple ways we could extend the preferences, depending on how the problem is presented to us—we might accept that multiple extrapolations are possible, and that we should take a conservative mix of them all, and accept that we’ll never “learn” anything more.
We want to program an AI to be able to do that itself, checking in with us initially, but continuing beyond human capacity when we can no longer provide guidance.
2. Examples of model splintering
The AI uses a set F of features that it creates and updates itself. Only one of them is assigned by us—the feature R, the reward function. The AI also updates the probability distribution Q over these features (this defined a generalised model, M=(F,Q)). It aims to maximise the overall reward R.
When talking about M=(F,Q), unless otherwise specified, we’ll refer to whatever generalised model the AI is currently using.
We’ll train the AI with an initial labelled dataset D of situations; the label is the reward value for that situation. Later on, the AI may ask for a clarification (see later sections).
Basic happiness example
An AI is trained to make humans happy. Or, more precisely, it interacts with the humans sequentially, and, after the interaction, the humans click on “I’m happy with that interaction” or “I’m disappointed with that interaction”.
So let fh={0,1} be a Boolean that represents how the human clicked, let fπ be the AI’s policy, and, of course, R is the reward. So F={fh,fπ,R}.
In the training data D, the AI generates a policy, or we generate a policy for it, and the human clicks on happiness and disappointment. We then assign a reward of 1 to a click on happiness, and a reward of 0 on a click of disappointment (thus R=fh on D). In this initial training data, the reward and fh are the same.
The reward extends easily to the new domain
Let’s first look at a situation where model changes don’t change the definition of the reward. Here the AI adds another feature fm, which characterises whether the human starts the interaction in a good mood. Then the AI can improve its Q to see how fh and fπ interact with fm; presumably, a good initial mood increases the chances of fh=1.
Now the AI has a better distribution for fh, but no reason to doubt that fh is equivalent with the reward R.
A rewarded feature splinters
Now assume the AI adds another feature, fs, which checks whether the human smiles or not during the interaction; add this feature to F. Then, re-running the data D while looking for this feature, the AI realises that fs=True is almost perfectly equivalent with R=1.
Here the feature on which the reward depends has splintered: it might be fh that determines the reward, or it might be a (slightly noisy) fs. Or it might be some mixture of the two.
Though the rewarded feature has splintered, the reward itself hasn’t, because fh and fs are so highly correlated: maximising one of them maximises the other, on the data the AI already has.
The reward itself splinters
To splinter the reward, the AI has to experience multiple situations cases where fs and fh are no longer correlated.
For instance, fs can be true without fh if the smiling human leaves before filling out the survey (indeed, smiling is a much more reliable sign of happiness than the artificial measure of filling out a survey). Conversely, bored humans may fill out the survey randomly, giving positive fh without fs.
This is the central example of model splintering:
Multiple features could explain the reward in the training data D. But these features are now known to come apart in more general situations.
Independent features become non-additive
Another situation with model splintering would be where the reward is defined clearly by two different features, but there is never any tension between them—and then new situations appear where they are in tension.
Let’s imagine that the AI is a police officer and a social worker, and its goal is to bring happiness and enforce the law. So let F={fh,fp,fm,fs,fl,R} where fl is a feature checking whether the law was enforced when it needs to be.
In the training data D, there are examples with fh being True or False, while fl is undefined (no need to enforce any law). There are also examples with fl being True or False, while fh is undefined (no survey was offered). Whenever fh and fl were True, the reward R was 1, while it was 0 if they were False.
Then if the AI finds situations where both of fh and fl are defined, it doesn’t know how to extend the reward, especially if their values contradict each other.
Reconvergence of multiple reward features
It’s not automatically the case that as the AI learns more, rewards have to splinter more. Maybe the AI can develop another feature, fa, corresponding to human approval (or approval from its programmers). Then it can see fh and fl as being specific cases of fa - its programmers approve of humans saying they’re happy, and of the law being enforced. In that case, the AI could infer a more general reward that also covers situations where fh and fl are in contradiction with each other.
Changes due to policy choices
We’ve been considering situations where features have become uncorrelated just because the AI has got a broader understanding of the world. But we also need to consider situations where the AI’s own policy starts breaking some of the correlations that otherwise existed.
For example, we could split fl, enforce the law, into fL, a descriptive feature describing the law, and fe, a feature measuring whether that law is enforced.
Then in its training data, fL is fixed (or changes exogenously). But when the AI gets powerful, fL suddenly becomes malleable, dependent on its policy choices. This is another form of model splintering, one that we might want the AI to treat with extra caution and conservatism.
3. Dealing with model splintering
Detecting reward splintering
Reward splintering occurs when there are multiple ways of expressing R, on the labelled data D, and they lead to very different rewards in the world in general.
So we could have multiple reward functions r, all consistent with R over D. We could define a distance function d(r,R) which measures how far apart r and R are on D, and a complexity measure c(r). Then the ″goodness of fit″ of r to R could be
f(r)=exp(−d(r,R)−c(r)).
Thus reward functions have higher fit if they are close to R on the labelled data D, and are simpler to express. Then define m(r) as the maximum expected value of r (using Qi to computer expectations), if the AI were to use an r-maximising policy.
Then the following is always positive, and gives a good measure of the divergence between maximising the weighted mixes of the r, versus maximising the individual r’s:
∑rf(r)m(r)−m(∑rf(r)r).
When that metric hits a certain threshold, the AI knows that significant reward splintering has occurred.
Dealing with reward splintering: conservatism
One obvious way of dealing with reward splintering is to become conservative about the rewards.
Since human value is fragile, we would initially want to feed the AI with some specific over-conservative method of conservatism (such as smooth minimums). After learning more about our preferences, it should learn fragility of value directly, so could use a more bespoke method of conservatism[1].
Dealing with reward splintering: asking for advice
The other obvious solution is to ask humans for more reward information, and thus increase the set D on which it has reward information. Ideally, the AI would ask for information that best distinguishes between different reward functions that have high f(r) but that are hard to maximise simultaneously.
When advice starts to fail
Suppose the AI could ask question q1, that would give it labelled data D1. Alternatively, it could ask question q2, that would give it labelled data D2. And suppose that D1 and D2 would imply very different reward functions. Suppose further that the AI could deduce the Di likely to occur from the question qi.
In that case, the AI is getting close to rigging its own learning process, essentially choosing its own reward function and getting humans to sign off on it.
The situation is not yet hopeless. The AI should prefer asking questions that are ‘less manipulative’ and more honest. We could feed it a dataset QA of questions and answers, and label some of them as manipulative and some as not. Then the AI should choose questions that have features that are closer to the non-manipulative ones.
The AI can also update its estimate of manipulation, of QA, by proposing (non-manipulatively—notice the recursion here) some example questions and getting labelled feedback as to whether these were manipulative or not.
When advice fails
At some point, if the AI continues to grow in power and knowledge, it will reach a point where it can get the feedback Di by asking question qi - and all the qi would count as “non-manipulative” according to the criteria it has in QA. And the problem extends to QA itself—it knows that it can determine future QA and thus future qi and Di.
At this point, there’s nothing that we can teach the AI. Any lesson we could give, it already knows about, or knows it could have gotten the opposite lesson instead. It would use its own judgement to extrapolate R, D, QA, thus defining and completing the process of ‘idealisation’. Much of this would be things its learnt along the way, but we might want to add an extra layer of conservatism at the end of the process.
Reward splintering for AI design
This post will look at how model splintering can be used by an AI to extend human-specified rewards beyond its training environment, and beyond the range of what humans could do.
The key points are:
Most descriptive labels (eg “happiness”, “human being”) describe collections of correlated features, rather than fundamental concepts.
Model splintering is when the correlated features come apart, so that the label no longer applies so well.
Reward splintering is when the reward itself is defined on labels that are splintering.
We humans deal with these issues ourselves in various ways.
We may be able program an AI to deal with these in similar ways, using our feedback as needed, and extending beyond when we can no longer provide it with useful feedback.
Section 1 will use happiness as an example, defining it as a bundle of correlated features and see what happens when these start to splinter. Section 2 defines model and reward splintering in more formal terms. And finally section 3 will analyse how an AI could detect reward splintering and deal with it.
1. What is happiness? A bundle of correlated features
How might we define happiness today? Well, here’s one definition:
We can also include some chemicals in the brain, such as serotonin, dopamine, or oxytocin. There are implicit necessary features there as well, that are taken as trivially true today: happiness is experienced by biological beings, that have a continuity of experience and of identity. Genuine smiles are good indicators of happiness, as is saying “I am happy” in surveys.
So, what is happiness, today? Just like rubes and bleegs, happiness is a label assigned to a bunch of correlated features. You can think of it a similar to the “g factor”, an intelligence measure that is explicitly defined as a correlation of different cognitive task abilities.
And, just like rubes and bleggs, those features need not stay correlated in general environments. We can design or imagine situations where they easily come apart. People with frozen face muscles can’t smile, but can certainly be happy. Would people with anterograde amnesia be truly happy? What about simple algorithms that print out “I am happy”, for ever? Well, there it’s a judgement call. A continuity of identity and consciousness are implicit aspects of happiness; we may decide to keep them or not. We could define the algorithm as “happy” with “happiness” expanding to cover more situations. Or we could define a new term, “simple algorithmic happiness”, say, that carves out that situation. We can do the same with the anterograde amnesia (my personal instincts would be to include anterograde amnesia in a broader definition of happiness, while carving off the algorithm as something else.
Part of the reason to do that is to keep happiness as a useful descriptive term—to carve reality along its natural joints. And as reality gets more complicated or our understanding of it improves, the “natural joints” might change. For example, nationalities are much less well defined than, say, eye colour. But in terms of understanding history and human nature, nationality is a key concept, eye colour much less so. The opposite is true if we’re looking at genetics. So the natural joints of reality are shifting depending on space and time, and also on the subject being addressed.
Descriptive versus prescriptive
The above looks at “happiness” as a descriptive label, and how it might come to be refined or split. There’s are probably equations for how best to use labelled features in the descriptive sense, connected to the situations the AI is likely to find itself in, its own processing power and cost of computation, how useful it is for it to understand these situations, and so on.
But happiness is not just descriptive, it is often prescriptive (or normative): we would want an AI to increases happiness (among other things). So we attach value or reward labels to different state of affairs.
That makes the process much more tricky. If we say that people with anterograde amnesia don’t have “true happiness”, then we’re not just saying that our description of happiness works better if we split it into two. We’re saying that the “happiness” of those with anterograde amnesia is no longer a target for AI optimisation, i.e. that their happiness can be freely sacrificed to other goals.
There are some things we can do to extend happiness as preference/value/reward across such splintering:
We can look more deeply into why we think “happiness” is important. For instance, we clearly value it as an interior state, so if “smiles” splinter from “internal feeling of happiness”, we should clearly use the second.
We can use our meta-preferences to extend definitions across the splintering. Consistency, respect for the individual, conservatism, simplicity, analogy with other values: these are ways we can extend the definition to new areas.
When our meta-preferences become ambiguous—maybe there are multiple ways we could extend the preferences, depending on how the problem is presented to us—we might accept that multiple extrapolations are possible, and that we should take a conservative mix of them all, and accept that we’ll never “learn” anything more.
We want to program an AI to be able to do that itself, checking in with us initially, but continuing beyond human capacity when we can no longer provide guidance.
2. Examples of model splintering
The AI uses a set F of features that it creates and updates itself. Only one of them is assigned by us—the feature R, the reward function. The AI also updates the probability distribution Q over these features (this defined a generalised model, M=(F,Q)). It aims to maximise the overall reward R.
When talking about M=(F,Q), unless otherwise specified, we’ll refer to whatever generalised model the AI is currently using.
We’ll train the AI with an initial labelled dataset D of situations; the label is the reward value for that situation. Later on, the AI may ask for a clarification (see later sections).
Basic happiness example
An AI is trained to make humans happy. Or, more precisely, it interacts with the humans sequentially, and, after the interaction, the humans click on “I’m happy with that interaction” or “I’m disappointed with that interaction”.
So let fh={0,1} be a Boolean that represents how the human clicked, let fπ be the AI’s policy, and, of course, R is the reward. So F={fh,fπ,R}.
In the training data D, the AI generates a policy, or we generate a policy for it, and the human clicks on happiness and disappointment. We then assign a reward of 1 to a click on happiness, and a reward of 0 on a click of disappointment (thus R=fh on D). In this initial training data, the reward and fh are the same.
The reward extends easily to the new domain
Let’s first look at a situation where model changes don’t change the definition of the reward. Here the AI adds another feature fm, which characterises whether the human starts the interaction in a good mood. Then the AI can improve its Q to see how fh and fπ interact with fm; presumably, a good initial mood increases the chances of fh=1.
Now the AI has a better distribution for fh, but no reason to doubt that fh is equivalent with the reward R.
A rewarded feature splinters
Now assume the AI adds another feature, fs, which checks whether the human smiles or not during the interaction; add this feature to F. Then, re-running the data D while looking for this feature, the AI realises that fs=True is almost perfectly equivalent with R=1.
Here the feature on which the reward depends has splintered: it might be fh that determines the reward, or it might be a (slightly noisy) fs. Or it might be some mixture of the two.
Though the rewarded feature has splintered, the reward itself hasn’t, because fh and fs are so highly correlated: maximising one of them maximises the other, on the data the AI already has.
The reward itself splinters
To splinter the reward, the AI has to experience multiple situations cases where fs and fh are no longer correlated.
For instance, fs can be true without fh if the smiling human leaves before filling out the survey (indeed, smiling is a much more reliable sign of happiness than the artificial measure of filling out a survey). Conversely, bored humans may fill out the survey randomly, giving positive fh without fs.
This is the central example of model splintering:
Multiple features could explain the reward in the training data D. But these features are now known to come apart in more general situations.
Independent features become non-additive
Another situation with model splintering would be where the reward is defined clearly by two different features, but there is never any tension between them—and then new situations appear where they are in tension.
Let’s imagine that the AI is a police officer and a social worker, and its goal is to bring happiness and enforce the law. So let F={fh,fp,fm,fs,fl,R} where fl is a feature checking whether the law was enforced when it needs to be.
In the training data D, there are examples with fh being True or False, while fl is undefined (no need to enforce any law). There are also examples with fl being True or False, while fh is undefined (no survey was offered). Whenever fh and fl were True, the reward R was 1, while it was 0 if they were False.
Then if the AI finds situations where both of fh and fl are defined, it doesn’t know how to extend the reward, especially if their values contradict each other.
Reconvergence of multiple reward features
It’s not automatically the case that as the AI learns more, rewards have to splinter more. Maybe the AI can develop another feature, fa, corresponding to human approval (or approval from its programmers). Then it can see fh and fl as being specific cases of fa - its programmers approve of humans saying they’re happy, and of the law being enforced. In that case, the AI could infer a more general reward that also covers situations where fh and fl are in contradiction with each other.
Changes due to policy choices
We’ve been considering situations where features have become uncorrelated just because the AI has got a broader understanding of the world. But we also need to consider situations where the AI’s own policy starts breaking some of the correlations that otherwise existed.
For example, we could split fl, enforce the law, into fL, a descriptive feature describing the law, and fe, a feature measuring whether that law is enforced.
Then in its training data, fL is fixed (or changes exogenously). But when the AI gets powerful, fL suddenly becomes malleable, dependent on its policy choices. This is another form of model splintering, one that we might want the AI to treat with extra caution and conservatism.
3. Dealing with model splintering
Detecting reward splintering
Reward splintering occurs when there are multiple ways of expressing R, on the labelled data D, and they lead to very different rewards in the world in general.
So we could have multiple reward functions r, all consistent with R over D. We could define a distance function d(r,R) which measures how far apart r and R are on D, and a complexity measure c(r). Then the ″goodness of fit″ of r to R could be
f(r)=exp(−d(r,R)−c(r)).
Thus reward functions have higher fit if they are close to R on the labelled data D, and are simpler to express. Then define m(r) as the maximum expected value of r (using Qi to computer expectations), if the AI were to use an r-maximising policy.
Then the following is always positive, and gives a good measure of the divergence between maximising the weighted mixes of the r, versus maximising the individual r’s:
∑rf(r)m(r)−m(∑rf(r)r).
When that metric hits a certain threshold, the AI knows that significant reward splintering has occurred.
Dealing with reward splintering: conservatism
One obvious way of dealing with reward splintering is to become conservative about the rewards.
Since human value is fragile, we would initially want to feed the AI with some specific over-conservative method of conservatism (such as smooth minimums). After learning more about our preferences, it should learn fragility of value directly, so could use a more bespoke method of conservatism[1].
Dealing with reward splintering: asking for advice
The other obvious solution is to ask humans for more reward information, and thus increase the set D on which it has reward information. Ideally, the AI would ask for information that best distinguishes between different reward functions that have high f(r) but that are hard to maximise simultaneously.
When advice starts to fail
Suppose the AI could ask question q1, that would give it labelled data D1. Alternatively, it could ask question q2, that would give it labelled data D2. And suppose that D1 and D2 would imply very different reward functions. Suppose further that the AI could deduce the Di likely to occur from the question qi.
In that case, the AI is getting close to rigging its own learning process, essentially choosing its own reward function and getting humans to sign off on it.
The situation is not yet hopeless. The AI should prefer asking questions that are ‘less manipulative’ and more honest. We could feed it a dataset QA of questions and answers, and label some of them as manipulative and some as not. Then the AI should choose questions that have features that are closer to the non-manipulative ones.
The AI can also update its estimate of manipulation, of QA, by proposing (non-manipulatively—notice the recursion here) some example questions and getting labelled feedback as to whether these were manipulative or not.
When advice fails
At some point, if the AI continues to grow in power and knowledge, it will reach a point where it can get the feedback Di by asking question qi - and all the qi would count as “non-manipulative” according to the criteria it has in QA. And the problem extends to QA itself—it knows that it can determine future QA and thus future qi and Di.
At this point, there’s nothing that we can teach the AI. Any lesson we could give, it already knows about, or knows it could have gotten the opposite lesson instead. It would use its own judgement to extrapolate R, D, QA, thus defining and completing the process of ‘idealisation’. Much of this would be things its learnt along the way, but we might want to add an extra layer of conservatism at the end of the process.
Possibly with some meta-meta-considerations that modelling human conservatism is likely to underestimate the required conservatism.