Guarded learning

A putative new idea for AI control; index here.

“Guarded learning” is a model for unbiased learning, the kind of learning where the AI has an incentive to learn its values, but not to bias the direction in which it learns.


The value-uncertain

Assume the AI is uncertain between different utilities . Its actual utility is where the are indicator functions with and .

The learning process is whatever updates the AI’s estimates for the .

Writing the utility in this way means that the utility of future actions will be assessed according to the values of the future agent (as long as the current agent can trust the future agent’s assessment of the ), thus avoiding the naive cake or death problem.

At any given time, the expectations of define a weighted sum of , which can be seen as a point on -simplex . Define as the utility given by these expectations at time .

For any , define as the policy that maximises the invariant utility function . Thus if , a -maximiser will not necessarily follow , because might be different from , while always maximises .

At any given time , define the function my mapping to . This is the expectation of at , given that the AI follows a -maximising policy. Defining these requires some sort of logical or causal counterfactual—“given that the AI follows ”—but this is the only requirement in this setup. The are all convex, see the proof at the end of this post.

Value updating and correction

For illustration, consider the case . The -simplex is a line segment, which we’re identifying with :

The green curve is the plot of . The blue line connects with . The purple line is while the yellow line is . The curve must be contained between these lines.

The AI currently has . If the AI expected to learn more about its values, then its expected utility would rise along the green arrow. Reaching the top, on the blue line, corresponded to a situation where the agent expected to imminently be informed of whether its utility is or , and thus transform itself into a or maximiser. This kind of learning, up the green arrow, is what we want to encourage (note that this doesn’t make the learning safe, though—we need to pick the update process with care).

The red arrows point to updates we want to discourage—we don’t want the AI to manipulate the direction of its update to get higher utility (in this case, by moving towards ).

Correction term

Consider the agent operating as a discrete agent, moving from time to time . It choose an action ; as a consequence of that, at time , it expects a probability distribution over possible . If we assume that there are only finitely many possibilities (the continuous case is almost identical), then define .

Let be the expectation of , that is .

Then is an unbiased action (corresponding to moving up the blue arrow) if . One restriction we could do would be to require all valid to satisfy this. But this might be extremely restrictive. For instance, we probably wouldn’t mind if and . That’s the green point here:

Sure, the green point is slightly off-centre, but the AI has gained a lot of information. What’s clearly incorrect, is if . This is the red point in the above graph; the action has forced the AI’s values straight to .

One way of dealing with the “force the values to ” would be to require the AI to use its past values rather than its future values to evaluate the quality of the update. This would be like imposing the naive cake or death problem on the agent. The expectation would be , which is the orange point on the line here:

This solves the for the AI forcing the value of to a single value, but we need to extend it to (biased) distributions over multiple ’s. The natural way to do that is to define the penalty:

  • Penalty for biased , where is the subsequent policy.

It’s clear the penalty is when . However, the penalty is not always negative! Indeed, upon forcing , adding the penalty would move the agent’s utility to the orange point, higher than the red one:

Thus the penalty makes certain biased choice, that would otherwise be negative, slightly less negative. What is clear, however, is that, even with a positive penalty, it’s always better for the AI to keep than to force to the single value (ie the blue point is still higher than the orange one). This is because it’s total reward would then be

  • by the definition of .

So, in order to get a positive expectation from a biased action, the AI has to get a gain in information to overcome this loss.

Subagent stability

Is this penalty design subagent stable? It’s not clear. Creating a subagent without the penalty is not an action the AI would go (as that action means its expectation of will become very biased, as the subagent has no restriction on deliberately biasing the values of ).

However, an AI with would certainly create a subagent that randomly decided to force either or .

Convexity of

This section will prove the following theorem:

  • Theorem: If the AI’s utility depends only on its actions, the are convex.

Proof:

For any , there is a map that maps to . This map is linear, and defines an affine hyper-segment that contains .

Then we need to note that the curve cannot have a transverse intersection with (though they can be tangent on a convex subset). This is because a transverse intersection would imply there exists a with , ie . But this is contradicted because is defined to be the best policy for maximising .

Thus is a supporting hyperplane for , and hence, by the supporting hyperplane theorem, is convex.