Meta: This comment has my thoughts about the paper Pitfalls of
Learning a Reward Function Online.
I figure I should post them here so that others looking for comments
on the paper might find them.
I read the paper in back in 2020; it was on my backlog ever since to
think more about it and share my comments. Apologies for the delay,
etc.
Mathematical innovation
First off, I agree with the general observations in the introduction
that there are pitfalls to learning a reward function online, with a
human in the loop.
The paper looks at options for removing some of these pitfalls, or at
least to make them less dangerous. The research agenda pursued by the
paper is one I like a lot, an agenda of mathematical innovation. The
paper mathematically defines certain provable safety properties
(uninfluencability and unriggability), and also explores how useful
these might be.
Similar agendas of of mathematical innovation can be found in the work
of Everitt et al, for example in Agent Incentives: A Causal
Perspective, and in my work, for
example in AGI Agent Safety by Iteratively Improving the Utility
Function. These also use causal
influence diagrams in some way, and try to develop them in a way that
is useful for defining and analyzing AGI safety. My personal
intuition is that we need more of this type of work, this agenda is
important to advancing the field.
The math in the paper
That being said: the bad news is that I believe that the mathematical
route explored by Pitfalls of Learning a Reward Function
Online is most likely a dead end.
Understanding why is of course the interesting bit.
The main issue I will explore is: we have a mathematical property that
we label with the natural language word ‘uninfluencability’. But does
this property actually produce the beneficial ‘uninfluencability’
effects we are after? Section 4 in the paper also explores this
issue, and shows some problems, my main goal here is to add further
insights.
My feeling is that ‘uninfluencability’, the mathematical property as
defined, does not produce the effects I am after. To illustrate this,
my best example is as follows. Take a reward function Rs that
measures the amount of smiling, by the human teaching the
agent. observed over the entire history hn. Take a reward function
learning process which assumes (in its prior ρ) that the probability of
the choice for this reward function at the end of the history,
P(Rs|hn,ρ), cannot be influenced by the actions taken by the
agent during the history, so for example ρ is such that
∀hnP(Rs|hn,ρ)=1, This reward function learning
process is unriggable. But the agent using this reward function
learning process also has a major incentive to manipulate the human
teacher into smiling, by injecting them with smile-inducing drugs, or
whatever.
So it seems to me that the choice taken in the paper to achieve the
following design goal:
Ideally, we do not want the reward function to be a causal descendant
of the policy.
is not taking us on a route that goes anywhere very promising, given
the problem statement. The safety predicate of uninfluencability
still allows for conditions that insert the mind of the human teacher
directly into the path to value of a very powerful optimizer. To make
the mathematical property of ‘uninfluencability’ do what it says on
the tin, it seems to me that further constraints need to be added.
Some speculation: to go this route of adding constraints, I think we
need a model that separates the mind state of the teacher, or at least
some causal dependents of this mind state, more explicitly from the
remainder of the agent environment. There are several such
increased-separation causal models in Reward Tampering Problems and
Solutions in Reinforcement Learning: A Causal Influence Diagram
Perspective and in Counterfactual
planning. This
then brings us back on the path of using the math of indifference, or
lack of causal incentives, to define safety properties.
Secondary remarks
Here are some further secondary remarks.
With the above remarks. I do not mean to imply that the
uninfluencability safety property as defined lacks any value: I may
still want to have this as a desirable safety property in an agent.
But if it were present, this triggers a new concern: if the
environment is such that the reward function is clearly influencable,
any learning system prior which is incompatible with that assumption
may be making some pretty strange assumptions about the environment.
These might produce unsafe consequences, or just vast inefficiencies,
in the behavior of the agent.
This theme could be explored more, but the paper does not do so, and I
have also not done so. (I spent some time trying to come up with
clarifying toy examples, but no example I constructed really clarified
things for me.)
More general concern: the approach in the paper suffers somewhat from
a methodological problem that I have seen more often in the AI and AGI
safety literature. At this point in time, there is a tendency to
frame every possible AI-related problem as a machine learning problem,
and to frame any solution as being the design of an improved machine
learning system. To me, this framing obfuscates the solution space.
To make this more specific: the paper sets out to define useful
constraints on ρ, a prior over the agent environment, but does
not consider the step of first exploring constraints on μ, the
actual agent environment itself. To me, the more natural approach
would be to first look for useful constraints on μ, and only then
to consider the option of projecting these into ρ as a backup
option, when μ happens to lack the constraints.
In my mind, the problem of an agent manipulating its teacher or
supervisor to maximize its reward is not a problem of machine
learning, but more fundamentally a problem of machine reasoning,
or even more fundamentally a problem which is present in any
game-theoretical setup where rewards are defined by a level of
indirection. I talk more at length about these methodological
points in my paper on
counterfactual planning.
If I use this level-of-indirection framing to back-project the design
in the paper, my first guess would be that ‘uninfluencability’ might
possibly say something about the agent having no incentive to hack its
own compute core in order to change the reward function encoded
within. But I am not sure if that first guess would pan out.
Meta: This comment has my thoughts about the paper Pitfalls of Learning a Reward Function Online. I figure I should post them here so that others looking for comments on the paper might find them.
I read the paper in back in 2020; it was on my backlog ever since to think more about it and share my comments. Apologies for the delay, etc.
Mathematical innovation
First off, I agree with the general observations in the introduction that there are pitfalls to learning a reward function online, with a human in the loop.
The paper looks at options for removing some of these pitfalls, or at least to make them less dangerous. The research agenda pursued by the paper is one I like a lot, an agenda of mathematical innovation. The paper mathematically defines certain provable safety properties (uninfluencability and unriggability), and also explores how useful these might be.
Similar agendas of of mathematical innovation can be found in the work of Everitt et al, for example in Agent Incentives: A Causal Perspective, and in my work, for example in AGI Agent Safety by Iteratively Improving the Utility Function. These also use causal influence diagrams in some way, and try to develop them in a way that is useful for defining and analyzing AGI safety. My personal intuition is that we need more of this type of work, this agenda is important to advancing the field.
The math in the paper
That being said: the bad news is that I believe that the mathematical route explored by Pitfalls of Learning a Reward Function Online is most likely a dead end. Understanding why is of course the interesting bit.
The main issue I will explore is: we have a mathematical property that we label with the natural language word ‘uninfluencability’. But does this property actually produce the beneficial ‘uninfluencability’ effects we are after? Section 4 in the paper also explores this issue, and shows some problems, my main goal here is to add further insights.
My feeling is that ‘uninfluencability’, the mathematical property as defined, does not produce the effects I am after. To illustrate this, my best example is as follows. Take a reward function Rs that measures the amount of smiling, by the human teaching the agent. observed over the entire history hn. Take a reward function learning process which assumes (in its prior ρ) that the probability of the choice for this reward function at the end of the history, P(Rs|hn,ρ), cannot be influenced by the actions taken by the agent during the history, so for example ρ is such that ∀hnP(Rs|hn,ρ)=1, This reward function learning process is unriggable. But the agent using this reward function learning process also has a major incentive to manipulate the human teacher into smiling, by injecting them with smile-inducing drugs, or whatever.
So it seems to me that the choice taken in the paper to achieve the following design goal:
is not taking us on a route that goes anywhere very promising, given the problem statement. The safety predicate of uninfluencability still allows for conditions that insert the mind of the human teacher directly into the path to value of a very powerful optimizer. To make the mathematical property of ‘uninfluencability’ do what it says on the tin, it seems to me that further constraints need to be added.
Some speculation: to go this route of adding constraints, I think we need a model that separates the mind state of the teacher, or at least some causal dependents of this mind state, more explicitly from the remainder of the agent environment. There are several such increased-separation causal models in Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective and in Counterfactual planning. This then brings us back on the path of using the math of indifference, or lack of causal incentives, to define safety properties.
Secondary remarks
Here are some further secondary remarks.
With the above remarks. I do not mean to imply that the uninfluencability safety property as defined lacks any value: I may still want to have this as a desirable safety property in an agent. But if it were present, this triggers a new concern: if the environment is such that the reward function is clearly influencable, any learning system prior which is incompatible with that assumption may be making some pretty strange assumptions about the environment. These might produce unsafe consequences, or just vast inefficiencies, in the behavior of the agent.
This theme could be explored more, but the paper does not do so, and I have also not done so. (I spent some time trying to come up with clarifying toy examples, but no example I constructed really clarified things for me.)
More general concern: the approach in the paper suffers somewhat from a methodological problem that I have seen more often in the AI and AGI safety literature. At this point in time, there is a tendency to frame every possible AI-related problem as a machine learning problem, and to frame any solution as being the design of an improved machine learning system. To me, this framing obfuscates the solution space. To make this more specific: the paper sets out to define useful constraints on ρ, a prior over the agent environment, but does not consider the step of first exploring constraints on μ, the actual agent environment itself. To me, the more natural approach would be to first look for useful constraints on μ, and only then to consider the option of projecting these into ρ as a backup option, when μ happens to lack the constraints.
In my mind, the problem of an agent manipulating its teacher or supervisor to maximize its reward is not a problem of machine learning, but more fundamentally a problem of machine reasoning, or even more fundamentally a problem which is present in any game-theoretical setup where rewards are defined by a level of indirection. I talk more at length about these methodological points in my paper on counterfactual planning.
If I use this level-of-indirection framing to back-project the design in the paper, my first guess would be that ‘uninfluencability’ might possibly say something about the agent having no incentive to hack its own compute core in order to change the reward function encoded within. But I am not sure if that first guess would pan out.