Koen.Holtman comments on Learning and manipulating learning

Koen.Holtman 21 Apr 2021 14:43 UTC
LW: 5 AF: 4
AF
Meta: This comment has my thoughts about the paper Pitfalls of Learning a Reward Function Online. I figure I should post them here so that others looking for comments on the paper might find them.

I read the paper in back in 2020; it was on my backlog ever since to think more about it and share my comments. Apologies for the delay, etc.

Mathematical innovation

First off, I agree with the general observations in the introduction that there are pitfalls to learning a reward function online, with a human in the loop.

The paper looks at options for removing some of these pitfalls, or at least to make them less dangerous. The research agenda pursued by the paper is one I like a lot, an agenda of mathematical innovation. The paper mathematically defines certain provable safety properties (uninfluencability and unriggability), and also explores how useful these might be.

Similar agendas of of mathematical innovation can be found in the work of Everitt et al, for example in Agent Incentives: A Causal Perspective, and in my work, for example in AGI Agent Safety by Iteratively Improving the Utility Function. These also use causal influence diagrams in some way, and try to develop them in a way that is useful for defining and analyzing AGI safety. My personal intuition is that we need more of this type of work, this agenda is important to advancing the field.

The math in the paper

That being said: the bad news is that I believe that the mathematical route explored by Pitfalls of Learning a Reward Function Online is most likely a dead end. Understanding why is of course the interesting bit.

The main issue I will explore is: we have a mathematical property that we label with the natural language word ‘uninfluencability’. But does this property actually produce the beneficial ‘uninfluencability’ effects we are after? Section 4 in the paper also explores this issue, and shows some problems, my main goal here is to add further insights.

My feeling is that ‘uninfluencability’, the mathematical property as defined, does not produce the effects I am after. To illustrate this, my best example is as follows. Take a reward function $R^{s}$ that measures the amount of smiling, by the human teaching the agent. observed over the entire history $h_{n}$ . Take a reward function learning process which assumes (in its prior $ρ$ ) that the probability of the choice for this reward function at the end of the history, $P (R^{s} | h_{n}, ρ)$ , cannot be influenced by the actions taken by the agent during the history, so for example $ρ$ is such that $\forall_{h_{n}} P (R^{s} | h_{n}, ρ) = 1$ , This reward function learning process is unriggable. But the agent using this reward function learning process also has a major incentive to manipulate the human teacher into smiling, by injecting them with smile-inducing drugs, or whatever.

So it seems to me that the choice taken in the paper to achieve the following design goal:

Ideally, we do not want the reward function to be a causal descendant of the policy.

is not taking us on a route that goes anywhere very promising, given the problem statement. The safety predicate of uninfluencability still allows for conditions that insert the mind of the human teacher directly into the path to value of a very powerful optimizer. To make the mathematical property of ‘uninfluencability’ do what it says on the tin, it seems to me that further constraints need to be added.

Some speculation: to go this route of adding constraints, I think we need a model that separates the mind state of the teacher, or at least some causal dependents of this mind state, more explicitly from the remainder of the agent environment. There are several such increased-separation causal models in Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective and in Counterfactual planning. This then brings us back on the path of using the math of indifference, or lack of causal incentives, to define safety properties.

Secondary remarks

Here are some further secondary remarks.

With the above remarks. I do not mean to imply that the uninfluencability safety property as defined lacks any value: I may still want to have this as a desirable safety property in an agent. But if it were present, this triggers a new concern: if the environment is such that the reward function is clearly influencable, any learning system prior which is incompatible with that assumption may be making some pretty strange assumptions about the environment. These might produce unsafe consequences, or just vast inefficiencies, in the behavior of the agent.

This theme could be explored more, but the paper does not do so, and I have also not done so. (I spent some time trying to come up with clarifying toy examples, but no example I constructed really clarified things for me.)

More general concern: the approach in the paper suffers somewhat from a methodological problem that I have seen more often in the AI and AGI safety literature. At this point in time, there is a tendency to frame every possible AI-related problem as a machine learning problem, and to frame any solution as being the design of an improved machine learning system. To me, this framing obfuscates the solution space. To make this more specific: the paper sets out to define useful constraints on $ρ$ , a prior over the agent environment, but does not consider the step of first exploring constraints on $μ$ , the actual agent environment itself. To me, the more natural approach would be to first look for useful constraints on $μ$ , and only then to consider the option of projecting these into $ρ$ as a backup option, when $μ$ happens to lack the constraints.

In my mind, the problem of an agent manipulating its teacher or supervisor to maximize its reward is not a problem of machine learning, but more fundamentally a problem of machine reasoning, or even more fundamentally a problem which is present in any game-theoretical setup where rewards are defined by a level of indirection. I talk more at length about these methodological points in my paper on counterfactual planning.

If I use this level-of-indirection framing to back-project the design in the paper, my first guess would be that ‘uninfluencability’ might possibly say something about the agent having no incentive to hack its own compute core in order to change the reward function encoded within. But I am not sure if that first guess would pan out.

Koen.Holtman comments on Learning and manipulating learning

Mathematical innovation

The math in the paper

Secondary remarks