The original footnote provides one example of this, which is for the model to check if its objective satisfies some criterion, and fail hard if it doesn’t. Now, if the model gets to the point where it’s actually just failing because of this, then gradient descent will probably just remove that check—but the trick is never to actually get there. By having such a check in the first place, the model makes it so that gradient descent won’t actually change its objective, since any change to its objective (keeping all the other parameters fixed, which is what gradient descent does since it computes partial derivatives) would lead to such a failure.
I don’t think this argument works. After the agent has made that commitment, it needs to set some threshold for the amount of goal shift that will cause it to fail hard. But until the agent hits that threshold, the gradient will continue to point in the direction of that threshold. And with a non-infinitesimal learning rate, you’ll eventually cross that threshold, and the agent will respond by failing hard.
A possible counterargument: the agent’s ability to detect and enforce that threshold is not discrete, but also continuous, and so approaching the threshold will incur a penalty. But if that’s the case, then the gradients will point in the direction of removing the penalty by reducing the agent’s determination to fail upon detecting goal shift.
The way that this might still work is if modifications to this type of high-level commitment are harder to “detect” in partial derivatives than modifications to the underlying goals—e.g. if it’s hard to update away from the commitment without reducing the agent’s competence in other ways. And this seems kinda plausible, because high-level thought narrows down the space of outcomes sharply. But this is even more speculative.
the gradients will point in the direction of removing the penalty by reducing the agent’s determination to fail upon detecting goal shift.
But it need not be the case, and indeed the “failing-hard mechanism” would be optimized for that to not be the case (in a gradient hacking scenario).
To quickly see that it need not be the case, suppose that the “failing-hard mechanism” is implemented as two subnetworks within the model such that each one of them can output a value that causes the model to fail hard, and they are designed to either both output such a value or both not output such a value. Changing any single weight within the two subnetworks would not break the “failing-hard mechanism”, and so we can expect all the partial derivatives with respect to weights within the two subnetworks to be close to zero (i.e. updating the weights in the direction of the gradient would not destroy the “failing-hard mechanism”).
In the section you quoted I’m talking about the case in which the extent to which the agent fails is fairly continuous. Also note that the OR function is not differentiable, and so the two subnetworks must be implementing some continuous approximation to it. In that case, it seems likely to me that there’s a gradient signal to change the failing-hard mechanism.
I feel like the last sentence was a little insufficient but I’m pretty uncertain about how to think intuitively about this topic. The only thing I’m fairly confident about is that intuitions based on discrete functions are somewhat misleading.
Also note that the OR function is not differentiable, and so the two subnetworks must be implementing some continuous approximation to it. In that case, it seems likely to me that there’s a gradient signal to change the failing-hard mechanism.
I didn’t mean feeding the outputs of the two subnetworks to an OR dedicated to that purpose (that would just be passing the buck to the output of that OR). Suppose instead that the task is classifying cat/dog images and that each subnetwork can independently cause the network to classify a dog image as a cat by messing with a completely different piece of logic (e.g. one subnetwork is doing the equivalent of causing a false detection of cat whiskers, and the other is doing the equivalent of causing a false detection of cat eyes) such that the loss of the model is similar if any of the two subnetworks or both “decide to make the model fail”.
I want to emphasize that I don’t argue that we should be concerned about such sophisticated mechanisms randomly appearing during training. I argue that, if a huge neural network implements a sufficiently powerful optimization process with a goal system that involves our world, then it seems possible that that optimization process would construct such sophisticated mechanisms within the neural network. (And so the above is merely an argument that such theoretical mechanisms exist, not that they are easy to construct.)
I wasn’t claiming that there’ll be an explicit OR gate, just something functionally equivalent to it. To take a simple case, imagine that the two subnetworks output a real number each, which are multiplied together to get a final output, which we can interpret as the agent’s reward (there’d need to be some further module which chooses behaviours in order to get that much reward, but let’s ignore it for now). Each of the submodules’ outputs measures how much that subnetwork thinks the agent’s original goal has been preserved. Suppose that normally both subnetworks output 1, and then they switch to outputting 0 when they think they’ve passed the threshold of corruption, which makes the agent get 0 reward.
I agree that, at this point, there’s no gradient signal to change the subnetworks. My points are that:
There’s still a gradient signal to change the OR gate (in this case, the implementation of multiplication).
Consider how they got to the point of outputting 0. They must have been decreasing from 1 as the overall network changed. So as the network changed, and they started producing outputs less than 1, there’d be pressure to modify them.
The point above isn’t true if the subnetworks go from 1 to 0 within one gradient step. In that case, the network will likely either bounce back and forth across the threshold (eroding the OR gate every time it does so) or else remain very close to the threshold (since there’s no penalty for doing so). But since the transition from 1 to 0 needs to be continuous at *some* resolution, staying very *very* close to the threshold will produce subnetwork output somewhere between 0 and 1, which creates pressure for the subnetworks to be less accurate.
4. It’s non-obvious that agents will have anywhere near enough control over their internal functioning to set up such systems. Have you ever tried implementing two novel independent identical submodules in your brain? (Independence is very tricky because they’re part of the same plan, and so a change in your underlying motivation to pursue that plan affects both). Ones which are so sensitive to your motivations that they can go from 1 to 0 within the space of a single gradient update?
To be honest, this is all incredibly speculative, so please interpret all of the above with the disclaimer that it’s probably false or nonsensical for reasons I haven’t thought of yet.
I wasn’t claiming that there’ll be an explicit OR gate, just something functionally equivalent to it.
Sure, we’re on the same page here. I think by “There’s still a gradient signal to change the OR gate” you mean exactly what I meant when I said “that would just be passing the buck to the output of that OR”.
I’m not sure I understand 2 and 3. The activations are in practice discrete (e.g. represented by 32 bits), and so the subnetworks can be designed such that they never output values within the range [0.01,0.99] (if that’s important/useful for the mechanism to work).
It’s non-obvious that agents will have anywhere near enough control over their internal functioning to set up such systems. Have you ever tried implementing two novel independent identical submodules in your brain?
Humans can’t control their brain in the level of abstraction of neurons—by thinking alone—but in a higher level of abstraction they do have some control that can be useful. For example, consider a human in a Newcomb’s problem that decides to 1-box. Arguably, they reason in a certain way in order to make their brain have a certain property (namely, being a brain that decides to 1-box in a Newcomb’s problem).
(Independence is very tricky because they’re part of the same plan, and so a change in your underlying motivation to pursue that plan affects both).
Perhaps I shouldn’t have used the word “independent”; I just meant that the output of one subnetwork does not affect the output of the other (during any given inference).
I don’t think this argument works. After the agent has made that commitment, it needs to set some threshold for the amount of goal shift that will cause it to fail hard. But until the agent hits that threshold, the gradient will continue to point in the direction of that threshold. And with a non-infinitesimal learning rate, you’ll eventually cross that threshold, and the agent will respond by failing hard.
A possible counterargument: the agent’s ability to detect and enforce that threshold is not discrete, but also continuous, and so approaching the threshold will incur a penalty. But if that’s the case, then the gradients will point in the direction of removing the penalty by reducing the agent’s determination to fail upon detecting goal shift.
The way that this might still work is if modifications to this type of high-level commitment are harder to “detect” in partial derivatives than modifications to the underlying goals—e.g. if it’s hard to update away from the commitment without reducing the agent’s competence in other ways. And this seems kinda plausible, because high-level thought narrows down the space of outcomes sharply. But this is even more speculative.
But it need not be the case, and indeed the “failing-hard mechanism” would be optimized for that to not be the case (in a gradient hacking scenario).
To quickly see that it need not be the case, suppose that the “failing-hard mechanism” is implemented as two subnetworks within the model such that each one of them can output a value that causes the model to fail hard, and they are designed to either both output such a value or both not output such a value. Changing any single weight within the two subnetworks would not break the “failing-hard mechanism”, and so we can expect all the partial derivatives with respect to weights within the two subnetworks to be close to zero (i.e. updating the weights in the direction of the gradient would not destroy the “failing-hard mechanism”).
In the section you quoted I’m talking about the case in which the extent to which the agent fails is fairly continuous. Also note that the OR function is not differentiable, and so the two subnetworks must be implementing some continuous approximation to it. In that case, it seems likely to me that there’s a gradient signal to change the failing-hard mechanism.
I feel like the last sentence was a little insufficient but I’m pretty uncertain about how to think intuitively about this topic. The only thing I’m fairly confident about is that intuitions based on discrete functions are somewhat misleading.
I didn’t mean feeding the outputs of the two subnetworks to an OR dedicated to that purpose (that would just be passing the buck to the output of that OR). Suppose instead that the task is classifying cat/dog images and that each subnetwork can independently cause the network to classify a dog image as a cat by messing with a completely different piece of logic (e.g. one subnetwork is doing the equivalent of causing a false detection of cat whiskers, and the other is doing the equivalent of causing a false detection of cat eyes) such that the loss of the model is similar if any of the two subnetworks or both “decide to make the model fail”.
I want to emphasize that I don’t argue that we should be concerned about such sophisticated mechanisms randomly appearing during training. I argue that, if a huge neural network implements a sufficiently powerful optimization process with a goal system that involves our world, then it seems possible that that optimization process would construct such sophisticated mechanisms within the neural network. (And so the above is merely an argument that such theoretical mechanisms exist, not that they are easy to construct.)
I wasn’t claiming that there’ll be an explicit OR gate, just something functionally equivalent to it. To take a simple case, imagine that the two subnetworks output a real number each, which are multiplied together to get a final output, which we can interpret as the agent’s reward (there’d need to be some further module which chooses behaviours in order to get that much reward, but let’s ignore it for now). Each of the submodules’ outputs measures how much that subnetwork thinks the agent’s original goal has been preserved. Suppose that normally both subnetworks output 1, and then they switch to outputting 0 when they think they’ve passed the threshold of corruption, which makes the agent get 0 reward.
I agree that, at this point, there’s no gradient signal to change the subnetworks. My points are that:
There’s still a gradient signal to change the OR gate (in this case, the implementation of multiplication).
Consider how they got to the point of outputting 0. They must have been decreasing from 1 as the overall network changed. So as the network changed, and they started producing outputs less than 1, there’d be pressure to modify them.
The point above isn’t true if the subnetworks go from 1 to 0 within one gradient step. In that case, the network will likely either bounce back and forth across the threshold (eroding the OR gate every time it does so) or else remain very close to the threshold (since there’s no penalty for doing so). But since the transition from 1 to 0 needs to be continuous at *some* resolution, staying very *very* close to the threshold will produce subnetwork output somewhere between 0 and 1, which creates pressure for the subnetworks to be less accurate.
4. It’s non-obvious that agents will have anywhere near enough control over their internal functioning to set up such systems. Have you ever tried implementing two novel independent identical submodules in your brain? (Independence is very tricky because they’re part of the same plan, and so a change in your underlying motivation to pursue that plan affects both). Ones which are so sensitive to your motivations that they can go from 1 to 0 within the space of a single gradient update?
To be honest, this is all incredibly speculative, so please interpret all of the above with the disclaimer that it’s probably false or nonsensical for reasons I haven’t thought of yet.
An intuition I’m drawing on here: https://lamport.azurewebsites.net/pubs/buridan.pdf
https://arxiv.org/abs/2204.06974 presents an OR-like construction which shields from gradients too, apparently, which might be of interest.
Sure, we’re on the same page here. I think by “There’s still a gradient signal to change the OR gate” you mean exactly what I meant when I said “that would just be passing the buck to the output of that OR”.
I’m not sure I understand 2 and 3. The activations are in practice discrete (e.g. represented by 32 bits), and so the subnetworks can be designed such that they never output values within the range [0.01,0.99] (if that’s important/useful for the mechanism to work).
Humans can’t control their brain in the level of abstraction of neurons—by thinking alone—but in a higher level of abstraction they do have some control that can be useful. For example, consider a human in a Newcomb’s problem that decides to 1-box. Arguably, they reason in a certain way in order to make their brain have a certain property (namely, being a brain that decides to 1-box in a Newcomb’s problem).
Perhaps I shouldn’t have used the word “independent”; I just meant that the output of one subnetwork does not affect the output of the other (during any given inference).