Roughly speaking, you can imagine two ways to get safety:
Design the output channels so that unsafe actions / plans do not exist
Design the AI system so that even though unsafe actions / plans do exist, the AI system doesn’t take them.
I would rephrase your argument as “there are some types of STEM AI that are safe because of 1, it seems that given some reasonable loss function those AI systems should be said to be outer aligned at optimum”. This is also the argument that applies to image classifiers.
----
In the case where point 1 is literally true, I just wouldn’t even talk about whether the system is “aligned”; if it doesn’t have the possibility of an unsafe action, then whether it is “aligned” feels meaningless to me. (You can of course still say that it is “safe”.)
Note that in any such situation, there is no inner alignment worry. Even if the model is completely deceptive and wants to kill as many people as possible, by hypothesis we said that unsafe actions / plans do not exist, and the model can’t ever succeed at killing people.
----
A counterargument could be “okay, sure, some unsafe action / plan exists by which the AI takes over the world, but that happens only via side channels, not via the expected output channel”.
I note that in this case, if you include all the channels available to the AI system, then the system is not outer aligned at optimum, because the optimal thing to do is to take over the world and then always feed in inputs to which the outputs are perfectly known leading to zero loss.
Presumably what you’d want instead is to say something like “given a model in which the only output channel available to the AI system is ___, the optimal policy that only gets to act through that channel is aligned”. But this is basically saying that in the abstract model you’ve chosen, (1) applies; and again I feel like saying that this system is “aligned” is somehow missing the point of what “aligned” is supposed to mean.
As a concrete example, let’s take your image classifier example. 1. If we change the loss function so that dogs are labeled as cats and vice versa, is it still outer aligned at optimum (assuming the original was)? 2. What if it labeled humans as gorillas?
If you said yes to both, it’s still outer aligned at optimum, then hopefully you can see why the concept feels meaningless to me in this situation.
If you said no to both, these examples are no longer outer aligned at optimum, then I claim that the original loss function is also not outer aligned at optimum, because we could improve the categories used in the loss function (and it seems you agree that if the categories are worse then it is not outer aligned at optimum).
If you said yes to the first and no to the second, or yes to the second and no to the first, I have no idea what you mean by “outer aligned at optimum”.
----
Separately, even when you limit to a specific action space like classifying images, I could imagine that a literally optimal policy would still be able to take over the world given that action space (think of a policy that can predict and use the butterfly effect of classifying images), so I still don’t feel like it’s outer aligned at optimum. (Although perhaps this still doesn’t perform as well as the policy that magically knows all the answers and so can perfectly predict (what we label as) the class of any image.)
But this is not my real objection; my real objection is what I discussed above (that the concept “alignment” should not be tracking whether there does or does not exist an unsafe action in the AI’s action space).
Oops, I actually wasn’t trying to discuss whether the action-space was wide enough to take over the world. Turns out concrete examples can be ambiguous too. I was trying to highlight whether the loss function and training method incentivised taking over the world or not.
Instead of an image-classifier, lets take GPT-3, which has a wide enough action-space to take over the world. Lets assume that:
1. GPT-3 is currently being tested on on a validation set which have some correct answers. (I’m fine with “optimal performance” either requiring that GPT-3 magically returns these correct answers; or requiring that it returns some distribution along the lines that I defined in my post.)
2. The researchers will read what GPT-3 outputs, and there exists some string that causes them to go mad and give control over the world to GPT-3.
In this case, if we define optimal performance as “correctly predicting as many words as possible” or “achieve minimum total loss over the entire history of the world”, I agree that optimal performance would plausibly involve taking over the world to feed itself the right questions. However, I think a better definition is “Correctly predicts every word it’s asked to predict”, because I think this tracks what (un)supervised learning actually selects for (absent inner alignment concerns).
To make that last point more clear; I’m claiming that in a supervised learning scenario where predicting label X always gives the neural network a small punishment, but then causes the training procedure to give the model 10 easy questions, the model will learn to not choose label X. (At least in the long run. In the short run, I could see the model being a little bit more likely to predict label X than expected, because easy questions could give the model a less steep gradient, so a neural network that predicts X would stick around a bit longer than it would have if X hadn’t yielded easy questions. However, every time X is chosen, the nn is directionally pushed away from choosing it again; so in the infinite data limit, I think it would learn to not do that.) Conversely, if predicting label Y is correct on some inputs, but predicting Y immeidately lead to 10 hard questions, I don’t think the model would learn to avoid label Y (though I’m unsure if the learning process would converge to choosing Y or just be unstable and never converge). I’m actually very curious if you agree with this; it seems like an important question.
(Conversely, I think that a robot RL-agent would be incentivised to control its future challenges, because of how RL loss functions are shaped. A definition of RL optimal behavior should naturally take this into account.)
Now, by defining optimal behavior as “Correctly predicts every word it’s asked to predict”, I’m saying that the model is outer aligned at optimum iff the model cannot simultaneously answer every question optimally and adversarially optimise against humans. And if we only consider the normal output channel, I’m saying it couldn’t, because it would have to make incorrect predictions in order to broadcast its message; in which case it wouldn’t be optimal.
If we also consider side-channels, this gets messier, because my chosen definition doesn’t imply any constraints on what the AI could do with side channels. Thus, if it could use side channels to take over the world in any way, my definition would say it wasn’t outer aligned. My preferred solution to this is just to intentionally ignore side channels when defining outer alignment at optimum (as you correctly inferred). Three reasons for this:
As mentioned above, taking side channels into account would mean that any model with powerful side channels is classified as outer misaligned, even if there’s no incentive to use these side channels in any particular way.
Separately, I suspect that supervised learning normally doesn’t incentivise neural networks to use side channels in any particular way (absend inner alignment concerns).
Finally, It just seems kind of useful to talk about the outer alignment properties of abstract agent-models, since not all abstract agent-models are outer aligned. Side-constraints can be handled separately.
(Btw I’d say the bad image classifier is aligned but less performance-competitive than a good image classifier, though I’m sympathetic to the view that it doesn’t make sense to talk about its alignment properties at all.)
Oops, I actually wasn’t trying to discuss whether the action-space was wide enough to take over the world.
Ah, in hindsight your comment makes more sense.
I’m actually very curious if you agree with this; it seems like an important question.
Argh, I don’t know, you’re positing a setup that breaks the standard ML assumptions and so things get weird. If you have vanilla SGD, I think I agree, but I wouldn’t be surprised if that’s totally wrong.
There are definitely setups where I don’t agree, e.g. if you have an outer hyperparameter tuning loop around the SGD, then I think you can get the opposite behavior than what you’re claiming (I think this paper shows this in more detail, though it’s been edited significantly since I read it). That would still depend on how often you do the hyperparameter tuning, what hyperparameters you’re allowed to tune, etc.
----
On the rest of the comment: I feel like the argument you’re making is “when the loss function is myopic, the optimal policy ignores long-term consequences and is therefore safe”. I do feel better about this calling this “aligned at optimum”, if the loss function also incentivizes the AI system to do that which we designed the AI system for. It still feels like the lack of convergent instrumental subgoals is “just because of” the myopia, and that this strategy won’t work more generally.
----
Returning to the original claim:
Specifically, I think there exists some setups and some parsimonious definition of “optimal performance” [for STEM AI] such that optimal performance is aligned: and I claim that’s the more useful definition.
I do agree that these setups probably exist, perhaps using the myopia trick in conjunction with the simulated world trick. (I don’t think myopia by itself is enough; to have STEM AI enable a pivotal act you presumably need to give the AI system a non-trivial amount of “thinking time”.) I think you will still have a pretty rough time trying to define “optimal performance” in a way that doesn’t depend on a lot of details of the setup, but at least conceptually I see what you mean.
I’m not as convinced that these sorts of setups are really feasible—they seem to sacrifice a lot of benefits—but I’m pretty unconfident here.
Roughly speaking, you can imagine two ways to get safety:
Design the output channels so that unsafe actions / plans do not exist
Design the AI system so that even though unsafe actions / plans do exist, the AI system doesn’t take them.
I would rephrase your argument as “there are some types of STEM AI that are safe because of 1, it seems that given some reasonable loss function those AI systems should be said to be outer aligned at optimum”. This is also the argument that applies to image classifiers.
----
In the case where point 1 is literally true, I just wouldn’t even talk about whether the system is “aligned”; if it doesn’t have the possibility of an unsafe action, then whether it is “aligned” feels meaningless to me. (You can of course still say that it is “safe”.)
Note that in any such situation, there is no inner alignment worry. Even if the model is completely deceptive and wants to kill as many people as possible, by hypothesis we said that unsafe actions / plans do not exist, and the model can’t ever succeed at killing people.
----
A counterargument could be “okay, sure, some unsafe action / plan exists by which the AI takes over the world, but that happens only via side channels, not via the expected output channel”.
I note that in this case, if you include all the channels available to the AI system, then the system is not outer aligned at optimum, because the optimal thing to do is to take over the world and then always feed in inputs to which the outputs are perfectly known leading to zero loss.
Presumably what you’d want instead is to say something like “given a model in which the only output channel available to the AI system is ___, the optimal policy that only gets to act through that channel is aligned”. But this is basically saying that in the abstract model you’ve chosen, (1) applies; and again I feel like saying that this system is “aligned” is somehow missing the point of what “aligned” is supposed to mean.
As a concrete example, let’s take your image classifier example. 1. If we change the loss function so that dogs are labeled as cats and vice versa, is it still outer aligned at optimum (assuming the original was)? 2. What if it labeled humans as gorillas?
If you said yes to both, it’s still outer aligned at optimum, then hopefully you can see why the concept feels meaningless to me in this situation.
If you said no to both, these examples are no longer outer aligned at optimum, then I claim that the original loss function is also not outer aligned at optimum, because we could improve the categories used in the loss function (and it seems you agree that if the categories are worse then it is not outer aligned at optimum).
If you said yes to the first and no to the second, or yes to the second and no to the first, I have no idea what you mean by “outer aligned at optimum”.
----
Separately, even when you limit to a specific action space like classifying images, I could imagine that a literally optimal policy would still be able to take over the world given that action space (think of a policy that can predict and use the butterfly effect of classifying images), so I still don’t feel like it’s outer aligned at optimum. (Although perhaps this still doesn’t perform as well as the policy that magically knows all the answers and so can perfectly predict (what we label as) the class of any image.)
But this is not my real objection; my real objection is what I discussed above (that the concept “alignment” should not be tracking whether there does or does not exist an unsafe action in the AI’s action space).
Oops, I actually wasn’t trying to discuss whether the action-space was wide enough to take over the world. Turns out concrete examples can be ambiguous too. I was trying to highlight whether the loss function and training method incentivised taking over the world or not.
Instead of an image-classifier, lets take GPT-3, which has a wide enough action-space to take over the world. Lets assume that:
1. GPT-3 is currently being tested on on a validation set which have some correct answers. (I’m fine with “optimal performance” either requiring that GPT-3 magically returns these correct answers; or requiring that it returns some distribution along the lines that I defined in my post.)
2. The researchers will read what GPT-3 outputs, and there exists some string that causes them to go mad and give control over the world to GPT-3.
In this case, if we define optimal performance as “correctly predicting as many words as possible” or “achieve minimum total loss over the entire history of the world”, I agree that optimal performance would plausibly involve taking over the world to feed itself the right questions. However, I think a better definition is “Correctly predicts every word it’s asked to predict”, because I think this tracks what (un)supervised learning actually selects for (absent inner alignment concerns).
To make that last point more clear; I’m claiming that in a supervised learning scenario where predicting label X always gives the neural network a small punishment, but then causes the training procedure to give the model 10 easy questions, the model will learn to not choose label X. (At least in the long run. In the short run, I could see the model being a little bit more likely to predict label X than expected, because easy questions could give the model a less steep gradient, so a neural network that predicts X would stick around a bit longer than it would have if X hadn’t yielded easy questions. However, every time X is chosen, the nn is directionally pushed away from choosing it again; so in the infinite data limit, I think it would learn to not do that.) Conversely, if predicting label Y is correct on some inputs, but predicting Y immeidately lead to 10 hard questions, I don’t think the model would learn to avoid label Y (though I’m unsure if the learning process would converge to choosing Y or just be unstable and never converge). I’m actually very curious if you agree with this; it seems like an important question.
(Conversely, I think that a robot RL-agent would be incentivised to control its future challenges, because of how RL loss functions are shaped. A definition of RL optimal behavior should naturally take this into account.)
Now, by defining optimal behavior as “Correctly predicts every word it’s asked to predict”, I’m saying that the model is outer aligned at optimum iff the model cannot simultaneously answer every question optimally and adversarially optimise against humans. And if we only consider the normal output channel, I’m saying it couldn’t, because it would have to make incorrect predictions in order to broadcast its message; in which case it wouldn’t be optimal.
If we also consider side-channels, this gets messier, because my chosen definition doesn’t imply any constraints on what the AI could do with side channels. Thus, if it could use side channels to take over the world in any way, my definition would say it wasn’t outer aligned. My preferred solution to this is just to intentionally ignore side channels when defining outer alignment at optimum (as you correctly inferred). Three reasons for this:
As mentioned above, taking side channels into account would mean that any model with powerful side channels is classified as outer misaligned, even if there’s no incentive to use these side channels in any particular way.
Separately, I suspect that supervised learning normally doesn’t incentivise neural networks to use side channels in any particular way (absend inner alignment concerns).
Finally, It just seems kind of useful to talk about the outer alignment properties of abstract agent-models, since not all abstract agent-models are outer aligned. Side-constraints can be handled separately.
(Btw I’d say the bad image classifier is aligned but less performance-competitive than a good image classifier, though I’m sympathetic to the view that it doesn’t make sense to talk about its alignment properties at all.)
Ah, in hindsight your comment makes more sense.
Argh, I don’t know, you’re positing a setup that breaks the standard ML assumptions and so things get weird. If you have vanilla SGD, I think I agree, but I wouldn’t be surprised if that’s totally wrong.
There are definitely setups where I don’t agree, e.g. if you have an outer hyperparameter tuning loop around the SGD, then I think you can get the opposite behavior than what you’re claiming (I think this paper shows this in more detail, though it’s been edited significantly since I read it). That would still depend on how often you do the hyperparameter tuning, what hyperparameters you’re allowed to tune, etc.
----
On the rest of the comment: I feel like the argument you’re making is “when the loss function is myopic, the optimal policy ignores long-term consequences and is therefore safe”. I do feel better about this calling this “aligned at optimum”, if the loss function also incentivizes the AI system to do that which we designed the AI system for. It still feels like the lack of convergent instrumental subgoals is “just because of” the myopia, and that this strategy won’t work more generally.
----
Returning to the original claim:
I do agree that these setups probably exist, perhaps using the myopia trick in conjunction with the simulated world trick. (I don’t think myopia by itself is enough; to have STEM AI enable a pivotal act you presumably need to give the AI system a non-trivial amount of “thinking time”.) I think you will still have a pretty rough time trying to define “optimal performance” in a way that doesn’t depend on a lot of details of the setup, but at least conceptually I see what you mean.
I’m not as convinced that these sorts of setups are really feasible—they seem to sacrifice a lot of benefits—but I’m pretty unconfident here.