First off, I’m a big fan of formalizing things so that we can better
understand them. In the case of AI safety that, better understanding
may lead to new proposals for safety mechanisms or failure mode
analysis.
In my experience, once you manage to create a formal definition, it
seldom captures the exact or full meaning you expected the informal
term to have. Formalization usually exposes or clarifies certain
ambiguities in natural language. And this is often the key to
progress.
The problem with formalizing inner alignment
On this forum and in the broader community. I have seen a
certain anti-pattern appear. The community has so far avoided getting
too bogged down in discussing and comparing alternative definitions
and formalization’s of the intuitive term intelligence.
However, it has definitely gotten bogged down when it comes to the terms
corrigibility, goal-directedness, and inner alignment failure.
I have seen many cases of this happening:
The anti-pattern goes like this:
participant 1: I am now going to describe what I mean with the
concept of X∈{corrigibility, goal-directedness,inner alignment
failure}, as first step to make progress on this problem of X.
participants 2-n: Your description does not correspond to my
intuitive concept of X at all! Also, your steps 2 and 3
seem to be irrelevant to making progress on my concept of X,
because of the following reasons.
In this post on
corrigibility
I have have called corrigibility a term with a high linguistic
entropy, I think the same applies to the other two terms above.
These high-entropy terms seem to be good at producing long social
media discussions, but unfortunately these discussions seldom lead to
any conclusions or broadly shared insights. A lot of energy is lost
in this way. What we really want, ideally, is useful discussion about
the steps 2 and 3 that follow the definitional step.
On the subject of offering formal versions of inner alignment, you
write:
A weakness of this as it currently stands is that I purport to offer
the formal version of the inner optimization problem, but really, I
just gesture at a cloud of possible formal versions.
My recommendation would be to see the above weakness as a feature, not
a bug. I’d be interested in reading posts (or papers) where you pick
one formal problem out of this cloud and run with it, to develop new
proposals for safety mechanisms or failure mode analysis.
Some technical comments on the formal problem you identify
From your section ‘the formal problem’, I gather that the problems you
associate with inner alignment failures are those that might produce
treacherous turns or other forms of reward hacking.
You then consider the question if these failure modes could be
suppressed by somehow limiting the complexity of the ‘inner
optimization’ process, limited so that it is no longer capable of
finding the unwanted ‘malign’ solutions. I’ll give you my personal
intuition on that approach here, by way of an illustrative example.
Say we have a shepherd who wants to train a newborn lion as a
sheepdog. The shepherd punishes the lion whenever the lion tries to
eat a sheep. Now, once the lion is grown, it will either have
internalized the goal of not eating sheep but protecting them, or the
goal of not getting punished. If the latter, the lion may at one
point sneak up while the shepherd is sleeping and eat the shepherd.
It seems to me that the possibility of this treacherous turn happening
is encoded from the start into the lion’s environment and the
ambiguity inherent in their reward signal. For me, the design
approach of suppressing the treacherous turn dynamic by designing a
lion that will not be able to imagine the solution of eating the
shepherd seems like a very difficult one. The more natural route
would be to change the environment or reward function.
That being said, I can interpret Cohen’s imitation learner as a
solution that removes (or at least attempts to suppress) all
creativity from the lion’s thinking.
If you want to keep the lion creative, you are looking for a way to
robustly resolve the above inherent ambiguity in the lion’s reward
signal, to resolve it in a particular direction. Dogs are supposed to
have a mental architecture which makes this easier, so they can
be seen as an existence proof.
Reward hacking
I guess I should re-iterate that, though treacherous turns seem to be
the most popular example that comes up when people talk inner
optimizers, I see treacherous turns as just another example of reward
hacking, of maximizing the reward signal in a way that was not
intended by the original designers.
As ‘not intended by the original designers’ is a moral or utilitarian
judgment, it is difficult to capture it in math, except indirectly. We
can do it indirectly by declaring e.g. that a mentoring system is
available which shows the intention of the original designers
unambiguously by definition.
From your section ‘the formal problem’, I gather that the problems you associate with inner alignment failures are those that might produce treacherous turns or other forms of reward hacking.
It’s interesting that you think of treacherous turns as automatically reward hacking. I would differentiate reward hacking as cases where the treacherous turn is executed with the intention of taking over control of reward. In general, treacherous turns can be based on arbitrary goals. A fully inner-aligned system can engage in reward hacking.
It seems to me that the possibility of this treacherous turn happening is encoded from the start into the lion’s environment and the ambiguity inherent in their reward signal. For me, the design approach of suppressing the treacherous turn dynamic by designing a lion that will not be able to imagine the solution of eating the shepherd seems like a very difficult one. The more natural route would be to change the environment or reward function.
That being said, I can interpret Cohen’s imitation learner as a solution that removes (or at least attempts to suppress) all creativity from the lion’s thinking.
If you want to keep the lion creative, you are looking for a way to robustly resolve the above inherent ambiguity in the lion’s reward signal, to resolve it in a particular direction. Dogs are supposed to have a mental architecture which makes this easier, so they can be seen as an existence proof.
I think for outer-alignment purposes, what I want to respond here is “the lion needs feedback other than just rewards”. You can’t reliably teach the lion to “not ever each sheep” rather than “don’t eat sheep when humans are watching” when your only feedback mechanism can only be applied when humans are watching.
But if you could have the lion imagine hypothetical scenarios and provide feedback about them, then you could give feedback about whether it is OK to eat sheep when humans are not around.
To an extent, the answer is the same with inner alignment: more information/feedback is needed. But with inner alignment, we should be concerned even if we can look at the behavior in hypothetical scenarios and give feedback, because the system might be purposefully behaving differently in these hypothetical scenarios than it would in real situations. So here, we want to provide feedback (or prior information) about which forms of cognition are acceptable/unacceptable in the first place.
I guess I should re-iterate that, though treacherous turns seem to be the most popular example that comes up when people talk inner optimizers, I see treacherous turns as just another example of reward hacking, of maximizing the reward signal in a way that was not intended by the original designers.
As ‘not intended by the original designers’ is a moral or utilitarian judgment, it is difficult to capture it in math, except indirectly. We can do it indirectly by declaring e.g. that a mentoring system is available which shows the intention of the original designers unambiguously by definition.
I guess I wouldn’t want to use the term “reward hacking” for this, as it does not necessarily involve reward at all. The term “perverse instantiation” has been used—IE the general problem of optimizers spitting out dangerous things which are high on the proxy evaluation function but low in terms of what you really want.
I like your agenda. Some comments....
The benefit of formalizing things
First off, I’m a big fan of formalizing things so that we can better understand them. In the case of AI safety that, better understanding may lead to new proposals for safety mechanisms or failure mode analysis.
In my experience, once you manage to create a formal definition, it seldom captures the exact or full meaning you expected the informal term to have. Formalization usually exposes or clarifies certain ambiguities in natural language. And this is often the key to progress.
The problem with formalizing inner alignment
On this forum and in the broader community. I have seen a certain anti-pattern appear. The community has so far avoided getting too bogged down in discussing and comparing alternative definitions and formalization’s of the intuitive term intelligence.
However, it has definitely gotten bogged down when it comes to the terms corrigibility, goal-directedness, and inner alignment failure. I have seen many cases of this happening:
The anti-pattern goes like this:
participant 1: I am now going to describe what I mean with the concept of X∈{corrigibility, goal-directedness,inner alignment failure}, as first step to make progress on this problem of X.
participants 2-n: Your description does not correspond to my intuitive concept of X at all! Also, your steps 2 and 3 seem to be irrelevant to making progress on my concept of X, because of the following reasons.
In this post on corrigibility I have have called corrigibility a term with a high linguistic entropy, I think the same applies to the other two terms above.
These high-entropy terms seem to be good at producing long social media discussions, but unfortunately these discussions seldom lead to any conclusions or broadly shared insights. A lot of energy is lost in this way. What we really want, ideally, is useful discussion about the steps 2 and 3 that follow the definitional step.
On the subject of offering formal versions of inner alignment, you write:
My recommendation would be to see the above weakness as a feature, not a bug. I’d be interested in reading posts (or papers) where you pick one formal problem out of this cloud and run with it, to develop new proposals for safety mechanisms or failure mode analysis.
Some technical comments on the formal problem you identify
From your section ‘the formal problem’, I gather that the problems you associate with inner alignment failures are those that might produce treacherous turns or other forms of reward hacking.
You then consider the question if these failure modes could be suppressed by somehow limiting the complexity of the ‘inner optimization’ process, limited so that it is no longer capable of finding the unwanted ‘malign’ solutions. I’ll give you my personal intuition on that approach here, by way of an illustrative example.
Say we have a shepherd who wants to train a newborn lion as a sheepdog. The shepherd punishes the lion whenever the lion tries to eat a sheep. Now, once the lion is grown, it will either have internalized the goal of not eating sheep but protecting them, or the goal of not getting punished. If the latter, the lion may at one point sneak up while the shepherd is sleeping and eat the shepherd.
It seems to me that the possibility of this treacherous turn happening is encoded from the start into the lion’s environment and the ambiguity inherent in their reward signal. For me, the design approach of suppressing the treacherous turn dynamic by designing a lion that will not be able to imagine the solution of eating the shepherd seems like a very difficult one. The more natural route would be to change the environment or reward function.
That being said, I can interpret Cohen’s imitation learner as a solution that removes (or at least attempts to suppress) all creativity from the lion’s thinking.
If you want to keep the lion creative, you are looking for a way to robustly resolve the above inherent ambiguity in the lion’s reward signal, to resolve it in a particular direction. Dogs are supposed to have a mental architecture which makes this easier, so they can be seen as an existence proof.
Reward hacking
I guess I should re-iterate that, though treacherous turns seem to be the most popular example that comes up when people talk inner optimizers, I see treacherous turns as just another example of reward hacking, of maximizing the reward signal in a way that was not intended by the original designers.
As ‘not intended by the original designers’ is a moral or utilitarian judgment, it is difficult to capture it in math, except indirectly. We can do it indirectly by declaring e.g. that a mentoring system is available which shows the intention of the original designers unambiguously by definition.
It’s interesting that you think of treacherous turns as automatically reward hacking. I would differentiate reward hacking as cases where the treacherous turn is executed with the intention of taking over control of reward. In general, treacherous turns can be based on arbitrary goals. A fully inner-aligned system can engage in reward hacking.
I think for outer-alignment purposes, what I want to respond here is “the lion needs feedback other than just rewards”. You can’t reliably teach the lion to “not ever each sheep” rather than “don’t eat sheep when humans are watching” when your only feedback mechanism can only be applied when humans are watching.
But if you could have the lion imagine hypothetical scenarios and provide feedback about them, then you could give feedback about whether it is OK to eat sheep when humans are not around.
To an extent, the answer is the same with inner alignment: more information/feedback is needed. But with inner alignment, we should be concerned even if we can look at the behavior in hypothetical scenarios and give feedback, because the system might be purposefully behaving differently in these hypothetical scenarios than it would in real situations. So here, we want to provide feedback (or prior information) about which forms of cognition are acceptable/unacceptable in the first place.
I guess I wouldn’t want to use the term “reward hacking” for this, as it does not necessarily involve reward at all. The term “perverse instantiation” has been used—IE the general problem of optimizers spitting out dangerous things which are high on the proxy evaluation function but low in terms of what you really want.