I can think of several reasons why different people on this forum
might facepalm when seeing the diagram with the green boxes. Not sure
if I can correctly guess yours. Feel free to expand.
But there are definitely lots of people saying that AI alignment is part of the field of AI, and it sounds like you’re disagreeing with that as well—is that right?
Yes I am disagreeing, of sorts. I would disagree with the statement that
| AI alignment research is a subset of AI research
but I agree with the statement that
| Some parts of AI alignment research are a subset of AI research.
As argued in detail in the paper, I feel that fields outside of AI
research have a lot to contribute to AI alignment, especially when it
comes to correctly shaping and managing the effects of actually
deployed AI systems on society. Applied AI alignment is a true
multi-disciplinary problem.
On bravery debates in alignment
But there are definitely lots of people saying that AI alignment is part of the field of AI [...]
How much would you say that this categorization is a bravery debate
In the paper I mostly talk about what each field has to contribute in
expertise, but I believe there is definitely also a bravery debate
angle here, in the game of 4-dimensional policy discussion chess.
I am going to use the bravery debate definition from
here:
That’s what I mean by bravery debates. Discussions over who is bravely holding a nonconformist position in the face of persecution, and who is a coward defending the popular status quo and trying to silence dissenters.
I guess that a policy discussion devolving into a bravery debate is
one of these ‘many obstacles’ which I mention above, one of the many
general obstacles that stakeholders in policy discussions about AI
alignment, global warming, etc will need to overcome.
From what I can see, as a European reading the Internet, the whole
bravery debate
anti-pattern seems to be
very big right now in the US, and it has also popped up in discussions
about ethical uses of AI. AI technologists have been cast as both
cowardly defenders of the status quo, and as potentially brave
nonconformists who just need to be woken up, or have already woken up.
The facepalm was just because if this is really all inside the same RL architecture (no extra lines leading from the world-model to an unsupervised predictive reward), then all that will happen is the learned model will compensate for the distortions.
Not entirely sure what you mean with
your aside on ‘unsupervised predictive reward’. Is this a reference
to unsupervised reinforcement learning? To a human supervisor
controlling a live reward signal?
But on your observation that ‘the learned model will compensate for
distortions’: this sounds familiar. Here is a discussion.
Intuition pumps and inner alignment failure
It is common for people on this forum to use a teleological intuition
pump which makes them fear that such compensation for distortions must
somehow always happen, or is very likely to happen, as an emergent
phenomenon in any advanced RL or AGI system. Inner alignment failure has
become a popular term on this forum when talking about this fear.
But this teleological intuition pump that ‘life reward maximization,
uh, finds a way’ is deeply incorrect in the general case.
Especially for the case of agents which are designed not to be pure
reward maximizers.
I have more discussion about this in the paper, where I show how you
can approach figure 4 with very different intuition pumps about using
incorrect maps of territories. These intuition pumps that will tell you much more clearly how
and why this works. For the mathematically inclined, I also include
references to hard mathematical work, which should of course take
precedence over intuition.
Steel-manning an inner alignment failure argument
That being said, I’ll now provide provide some failure mode analysis
to show that ‘the learned model will compensate for distortions’ could
conceivably happen, if no care is taken at all to prevent it.
There are two world models in the cognitive architecture of figure 4,
a blue one and a green one. The green one drives the agent’s decisions.
The goal of the architecture is to ensure that this green word model driving
the reward-optimizing decisions is specifically incorrect.
To counteract that goal, via the failure mode of ‘the learned model
will compensate for distortions’, we have to imagine the following. We
have to imagine that ML algorithm in the figure will end up
introducing specific pre-distortions into the blue model, making the
blue world model in fact an incorrect world model in a way that
exactly anticipates and reverses the model edits that will be applied
to it later, producing a green world model that is a fully correct world model.
To make this rather complicated path to failure work,
We have to posit that the model editing operation has certain properties that allow it to be
attacked by a distortion technique in the first place. Not all
model editing algorithms can be attacked in this way. An editing
algorithm may work by just overwriting or eliminating exactly those parts of a world model
where adding the pre-distortions would make for a successful attack.
Or it may use a loop where it keeps adjusting pre-distorted model
parameters, for as long as needed until some desirability criterion
is met. (This is actually becoming a thing in mainstream DL
research on computational fairness, judging by some recent NeurIPS papers.)
We also have to posit that there is some pretty strong optimization pressure of a particular
kind going on, there has to be a feedback loop that will work to introduce
exactly the pre-distortions needed
in the first place. These distortions also have to be of a nature where
they do not directly contradict past observations, because the ML
algorithm is searching for a model that is compatible with past
observations, so it applies strong optimisation pressure of its own kind.
So this feedback loop has some rather fancy work to do,
if it is to achieve exactly the distortions that are needed above.
The most straightforward way to eliminate or suppress a feedback loop
achieving all of the above is to map the cognitive architecture in figure 4 to a model-based RL
system, not a model-free one. For the model-free case, ensuring that
the above feedback loop cannot emerge, under reasonable
i.i.d. assumptions about the agent environment, can be more tricky.
If you want to get even more fancy steel-manning this, you can replace
2 above by an adversarial attack scenario where a dedicated attacker
breaks reasonable i.i.d. assumptions. In that case all bets are off.
I feel that the inner optimization people on this forum are
correct that, when it comes to x-risk, we need to care about the possible existence of
feedback loops as under 2 above. But I also feel that they are too
pessimistic about our ability, or the ability of mainstream ML research
for that matter, to use mathematical tools and simulation experiments
to identify, avoid, or suppress these loops in a tractable way.
This isn’t about “inner alignment” (as catchy as the name might be), it’s just about regular old alignment.
But I think you’re right. As long as the learning step “erases” the model editing in a sensible way, then I was wrong and there won’t be an incentive for the learned model to compensate for the editing process.
So if you can identify a “customer gender detector” node in some complicated learned world-model, you can artificially set it to a middle value as a legitimate way to make the RL agent less sensitive to customer gender.
I’m not sure how well this approach fulfills modern regulatory guidelines, or how useful it is for AGI alignment, for basically the same reason: models that are learned from scratch sometimes encode knowledge in a distributed way that’s hard to edit out. Ultimately, your proof that a certain edit reduces gender bias is going to have to come from testing the system and checking that it behaves well, which is a kind of evidence a lot of regulators / businesspeople are skittish about.
Thanks!
I can think of several reasons why different people on this forum might facepalm when seeing the diagram with the green boxes. Not sure if I can correctly guess yours. Feel free to expand.
Yes I am disagreeing, of sorts. I would disagree with the statement that
| AI alignment research is a subset of AI research
but I agree with the statement that
| Some parts of AI alignment research are a subset of AI research.
As argued in detail in the paper, I feel that fields outside of AI research have a lot to contribute to AI alignment, especially when it comes to correctly shaping and managing the effects of actually deployed AI systems on society. Applied AI alignment is a true multi-disciplinary problem.
On bravery debates in alignment
In the paper I mostly talk about what each field has to contribute in expertise, but I believe there is definitely also a bravery debate angle here, in the game of 4-dimensional policy discussion chess.
I am going to use the bravery debate definition from here:
I guess that a policy discussion devolving into a bravery debate is one of these ‘many obstacles’ which I mention above, one of the many general obstacles that stakeholders in policy discussions about AI alignment, global warming, etc will need to overcome.
From what I can see, as a European reading the Internet, the whole bravery debate anti-pattern seems to be very big right now in the US, and it has also popped up in discussions about ethical uses of AI. AI technologists have been cast as both cowardly defenders of the status quo, and as potentially brave nonconformists who just need to be woken up, or have already woken up.
There is a very interesting paper which has a lot to say about this part of the narrative flow: Better, Nicer, Clearer, Fairer: A Critical Assessment of the Movement for Ethical Artificial Intelligence and Machine Learning. One point the authors make is that it benefits the established business community to thrust the mantle of the brave non-conformist saviour onto the shoulders of the AI technologists.
The facepalm was just because if this is really all inside the same RL architecture (no extra lines leading from the world-model to an unsupervised predictive reward), then all that will happen is the learned model will compensate for the distortions.
Not entirely sure what you mean with your aside on ‘unsupervised predictive reward’. Is this a reference to unsupervised reinforcement learning? To a human supervisor controlling a live reward signal?
But on your observation that ‘the learned model will compensate for distortions’: this sounds familiar. Here is a discussion.
Intuition pumps and inner alignment failure
It is common for people on this forum to use a teleological intuition pump which makes them fear that such compensation for distortions must somehow always happen, or is very likely to happen, as an emergent phenomenon in any advanced RL or AGI system. Inner alignment failure has become a popular term on this forum when talking about this fear.
But this teleological intuition pump that ‘
lifereward maximization, uh, finds a way’ is deeply incorrect in the general case. Especially for the case of agents which are designed not to be pure reward maximizers.I have more discussion about this in the paper, where I show how you can approach figure 4 with very different intuition pumps about using incorrect maps of territories. These intuition pumps that will tell you much more clearly how and why this works. For the mathematically inclined, I also include references to hard mathematical work, which should of course take precedence over intuition.
Steel-manning an inner alignment failure argument
That being said, I’ll now provide provide some failure mode analysis to show that ‘the learned model will compensate for distortions’ could conceivably happen, if no care is taken at all to prevent it.
There are two world models in the cognitive architecture of figure 4, a blue one and a green one. The green one drives the agent’s decisions. The goal of the architecture is to ensure that this green word model driving the reward-optimizing decisions is specifically incorrect.
To counteract that goal, via the failure mode of ‘the learned model will compensate for distortions’, we have to imagine the following. We have to imagine that ML algorithm in the figure will end up introducing specific pre-distortions into the blue model, making the blue world model in fact an incorrect world model in a way that exactly anticipates and reverses the model edits that will be applied to it later, producing a green world model that is a fully correct world model.
To make this rather complicated path to failure work,
We have to posit that the model editing operation has certain properties that allow it to be attacked by a distortion technique in the first place. Not all model editing algorithms can be attacked in this way. An editing algorithm may work by just overwriting or eliminating exactly those parts of a world model where adding the pre-distortions would make for a successful attack. Or it may use a loop where it keeps adjusting pre-distorted model parameters, for as long as needed until some desirability criterion is met. (This is actually becoming a thing in mainstream DL research on computational fairness, judging by some recent NeurIPS papers.)
We also have to posit that there is some pretty strong optimization pressure of a particular kind going on, there has to be a feedback loop that will work to introduce exactly the pre-distortions needed
in the first place. These distortions also have to be of a nature where they do not directly contradict past observations, because the ML algorithm is searching for a model that is compatible with past observations, so it applies strong optimisation pressure of its own kind. So this feedback loop has some rather fancy work to do, if it is to achieve exactly the distortions that are needed above.
The most straightforward way to eliminate or suppress a feedback loop achieving all of the above is to map the cognitive architecture in figure 4 to a model-based RL system, not a model-free one. For the model-free case, ensuring that the above feedback loop cannot emerge, under reasonable i.i.d. assumptions about the agent environment, can be more tricky.
If you want to get even more fancy steel-manning this, you can replace 2 above by an adversarial attack scenario where a dedicated attacker breaks reasonable i.i.d. assumptions. In that case all bets are off.
I feel that the inner optimization people on this forum are correct that, when it comes to x-risk, we need to care about the possible existence of feedback loops as under 2 above. But I also feel that they are too pessimistic about our ability, or the ability of mainstream ML research for that matter, to use mathematical tools and simulation experiments to identify, avoid, or suppress these loops in a tractable way.
This isn’t about “inner alignment” (as catchy as the name might be), it’s just about regular old alignment.
But I think you’re right. As long as the learning step “erases” the model editing in a sensible way, then I was wrong and there won’t be an incentive for the learned model to compensate for the editing process.
So if you can identify a “customer gender detector” node in some complicated learned world-model, you can artificially set it to a middle value as a legitimate way to make the RL agent less sensitive to customer gender.
I’m not sure how well this approach fulfills modern regulatory guidelines, or how useful it is for AGI alignment, for basically the same reason: models that are learned from scratch sometimes encode knowledge in a distributed way that’s hard to edit out. Ultimately, your proof that a certain edit reduces gender bias is going to have to come from testing the system and checking that it behaves well, which is a kind of evidence a lot of regulators / businesspeople are skittish about.