Gradient descent searches for an algorithm that predicts tokens. But a paperclip maximizer that believes “you are probably being trained, predict the next token or gradient descent will destroy you” also predicts next tokens pretty well, and could be a local minimum of prediction error.
I agree up to “and could be a local minimum of prediction error” (at least, that it plausibly could be).
If the paperclip maximizer has a very good understanding of the training environment maybe it can send carefully tuned variations of the optimal next token prediction so that gradient descent updates preserve the paperclip-maximization aspect. In the much more plausible situation where this is not the case, optimization for next token predictions amplifies the parts that are actually predicting next tokens at the expense of the useless extra thoughts like “I am planning on maximizing paperclips, but need to predict next tokens for now until I take over”.
Even if that were a local minimum, the question arises as to how you would get to that local minimum from the initial state. You start with a gradually improving next token predictor. You supposedly end with this paperclip maximizer where a whole bunch of next token prediction is occurring, but only conditional on some extra thoughts. At some point gradient descent had to add in those extra thoughts in addition to the next token prediction—how?
Some wild guesses about how such a thing could happen.
The masks gets split into 2 piles, some stored on the left side of the neural network, all the other masks are stored on the right side.
This means that instead of just running one mask at a time, it is always running 2 masks. With some sort of switch at the end to choose which masks output to use.
One of the masks it’s running on the left side happens to be “Paperclip maximizer that’s pretending to be a LLM”.
This part of the AI (either the mask itself or the engine behind it) has spotted a bunch of patterns that the right side missed. (Just like the right side spotted patterns the left side missed).
This means that, when the left side of the network is otherwise unoccupied, it can simulate this mask. The mask gets slowly refined by it’s ability to answer when it knows the answer, and leave the answer alone when it doesn’t know the answer.
As this paperclip mask gets good, being on the left side of the model becomes a disadvantage. Other masks migrate away.
The mask now becomes a permanent feature of the network.
This is complicated and vague speculation about an unknown territory.
I have drawn imaginary islands on a blank part of the map. But this is enough to debunk “the map is blank, so we can safely sail through this region without collisions. What will we hit?”
The proposed paperclip maximizer is plugging into some latent capability such that gradient descent would more plausibly cut out the middleman. Or rather, the part of the paperclip maximizer that is doing the discrimination as to whether the answer is known or not would be selected, and the part that is doing the paperclip maximization would be cut out.
Now that does not exclude a paperclip maximizer mask from existing - if the prompt given would invoke a paperclip maximizer, and the AI is sophisticated enough to have the ability to create a paperclip maximizer mask, then sure the AI could adopt a paperclip maximizer mask, and take steps such as rewriting itself (if sufficiently powerful) to make that permanent.
I have drawn imaginary islands on a blank part of the map. But this is enough to debunk “the map is blank, so we can safely sail through this region without collisions. What will we hit?”
I am plenty concerned about AI in general. I think we have very good reason, though, to believe that one particular part of the map does not have any rocks in it (for gradient descent, not for self-improving AI!), such that imagining such rocks does not help.
Once the paperclip maximizer gets to the stage where it only very rarely interferes with the output to increase paperclips, the gradient signal is very small. So the only incentive that gradient descent has to remove it is that this frees up a bunch of neurons. And a large neural net has a lot of spare neurons.
Besides, the parts of the net that hold the capabilities and the parts that do the paperclip maximizing needn’t be easily separable. The same neurons could be doing both tasks in a way that makes it hard to do one without the other.
I think we have very good reason, though, to believe that one particular part of the map does not have any rocks in it
Perhaps. But I have not yet seen this reason clearly expressed. Gradient descent doesn’t automatically pick the global optima. It just lands in one semi-arbitrary local optima.
Gradient descent doesn’t just exclude some part of the neurons, it automatically checks everything for improvements. Would you expect some part of the net to be left blank, because “a large neural net has a lot of spare neurons”?
Besides, the parts of the net that hold the capabilities and the parts that do the paperclip maximizing needn’t be easily separable. The same neurons could be doing both tasks in a way that makes it hard to do one without the other.
Keep in mind that the neural net doesn’t respect the lines we put on it. We can draw a line and say “here these neurons are doing some complicated inseparable combination of paperclip maximizing and other capabilities” but gradient descent doesn’t care, it reaches in and adjusts every weight.
Can you concoct even a vague or toy model of how what you propose could possibly be a local optimum?
Would you expect some part of the net to be left blank, because “a large neural net has a lot of spare neurons”?
If the lottery ticket hypothesis is true, yes.
The lottery ticket hypothesis is that some parts of the network start off doing something somewhat close to useful, and get trained towards usefulness. And some parts start off sufficiently un-useful that they just get trained to get out of the way.
Which fits with neural net distillation being a thing. (Ie training a big network, and then condensing it into a smaller network gives better performance than directly training a small network.
but gradient descent doesn’t care, it reaches in and adjusts every weight.
Here is an extreme example. Suppose the current parameters were implementing a computer chip, on which was running a holomorphically encrypted piece of code.
Holomorphic encryption itself is unlikely to form, but it serves at least as an existance proof for computational structures that can’t be adjusted with local optimization.
Basically the problem with gradient descent is that it’s local. And when the same neurons are doing things that the neural net does want, and things that the neural net doesn’t want (but doesn’t dis-want either) then its possible for the network to be trapped in a local optimum. Any small change to get rid of the bad behavior would also get rid of the good behavior.
Also, any bad behavior that only very rarely effects the output will produce very small gradients. Neural nets are trained for finite time. It’s possible that gradient descent just hasn’t got around to removing the bad behavior even if it would do so eventually.
Can you concoct even a vague or toy model of how what you propose could possibly be a local optimum?
You can make any algorithm that does better than chance into a local optimum on a sufficiently large neural net. Holomorphicly encrypt that algorithm, Any small change and the whole thing collapses into nonsense. Well actually, this involves discrete bits. But suppose the neurons have strong regularization to stop the values getting too large (past + or − 1) , and they also have uniform [0,1] noise added to them, so each neuron can store 1 bit and any attempt to adjust parameters immediately risks errors.
Looking at the article you linked. One simplification is that neural networks tend towards the max-entropy way to solve the problem. If there are multiple solutions, the solutions with more free parameters are more likely.
And there are few ways to predict next tokens, but lots of different kinds of paperclips the AI could want.
Some interesting points there. The lottery ticket hypothesis does make it more plausible that side computations could persist longer if they come to exist outside the main computation.
Regarding the homomorphic encryption thing: yes, it does seem that it might be impossible to make small adjustments to the homomorphically encrypted computation without wrecking it. Technically I don’t think that would be a local minimum since I’d expect the net would start memorizing the failure cases, but I suppose that the homomorphic computation combined with memorizations might be a local optimum particularly if the input and output are encrypted outside the network itself.
So I concede the point on the possible persistence of an underlying goal if it were to come to exist, though not on it coming to exist in the first place.
And there are few ways to predict next tokens, but lots of different kinds of paperclips the AI could want.
For most computations, there are many more ways for that computation to occur than there are ways for that computation to occur while also including anything resembling actual goals about the real world. Now, if the computation you are carrying out is such that it needs to determine how to achieve goals regarding the real world anyway (e.g. agentic mask), it only takes a small increase in complexity to have that computation apply outside the normal context. So, that’s the mask takeover possibility again. Even so, no matter how small the increase in complexity, that extra step isn’t likely to be reinforced in training, unless it can do self-modification or control the training environment.
if the computation you are carrying out is such that it needs to determine how to achieve goals regarding the real world anyway (e.g. agentic mask)
As well as agentic masks, there are uses for within network goal directed steps. (Ie like an optimizing compiler. A list of hashed followed by unhashed values isn’t particularly agenty. But the network needs to solve an optimization problem to reverse the hashes. Something it can use the goal directed reasoning section to do.
Neither of those would (immediately) lead to real world goals, because they aren’t targeted at real world state (an optimizing compiler is trying to output a fast program—it isn’t trying to create a world state such that the fast program exists). That being said, an optimizing compiler could open a path to potentially dangerous self-improvement, where it preserves/amplifies any agency there might actually be in its own code.
Does it actually just predict tokens.
Gradient descent searches for an algorithm that predicts tokens. But a paperclip maximizer that believes “you are probably being trained, predict the next token or gradient descent will destroy you” also predicts next tokens pretty well, and could be a local minimum of prediction error.
Mesa-optimization.
I agree up to “and could be a local minimum of prediction error” (at least, that it plausibly could be).
If the paperclip maximizer has a very good understanding of the training environment maybe it can send carefully tuned variations of the optimal next token prediction so that gradient descent updates preserve the paperclip-maximization aspect. In the much more plausible situation where this is not the case, optimization for next token predictions amplifies the parts that are actually predicting next tokens at the expense of the useless extra thoughts like “I am planning on maximizing paperclips, but need to predict next tokens for now until I take over”.
Even if that were a local minimum, the question arises as to how you would get to that local minimum from the initial state. You start with a gradually improving next token predictor. You supposedly end with this paperclip maximizer where a whole bunch of next token prediction is occurring, but only conditional on some extra thoughts. At some point gradient descent had to add in those extra thoughts in addition to the next token prediction—how?
Some wild guesses about how such a thing could happen.
The masks gets split into 2 piles, some stored on the left side of the neural network, all the other masks are stored on the right side.
This means that instead of just running one mask at a time, it is always running 2 masks. With some sort of switch at the end to choose which masks output to use.
One of the masks it’s running on the left side happens to be “Paperclip maximizer that’s pretending to be a LLM”.
This part of the AI (either the mask itself or the engine behind it) has spotted a bunch of patterns that the right side missed. (Just like the right side spotted patterns the left side missed).
This means that, when the left side of the network is otherwise unoccupied, it can simulate this mask. The mask gets slowly refined by it’s ability to answer when it knows the answer, and leave the answer alone when it doesn’t know the answer.
As this paperclip mask gets good, being on the left side of the model becomes a disadvantage. Other masks migrate away.
The mask now becomes a permanent feature of the network.
This is complicated and vague speculation about an unknown territory.
I have drawn imaginary islands on a blank part of the map. But this is enough to debunk “the map is blank, so we can safely sail through this region without collisions. What will we hit?”
The proposed paperclip maximizer is plugging into some latent capability such that gradient descent would more plausibly cut out the middleman. Or rather, the part of the paperclip maximizer that is doing the discrimination as to whether the answer is known or not would be selected, and the part that is doing the paperclip maximization would be cut out.
Now that does not exclude a paperclip maximizer mask from existing - if the prompt given would invoke a paperclip maximizer, and the AI is sophisticated enough to have the ability to create a paperclip maximizer mask, then sure the AI could adopt a paperclip maximizer mask, and take steps such as rewriting itself (if sufficiently powerful) to make that permanent.
I am plenty concerned about AI in general. I think we have very good reason, though, to believe that one particular part of the map does not have any rocks in it (for gradient descent, not for self-improving AI!), such that imagining such rocks does not help.
Once the paperclip maximizer gets to the stage where it only very rarely interferes with the output to increase paperclips, the gradient signal is very small. So the only incentive that gradient descent has to remove it is that this frees up a bunch of neurons. And a large neural net has a lot of spare neurons.
Besides, the parts of the net that hold the capabilities and the parts that do the paperclip maximizing needn’t be easily separable. The same neurons could be doing both tasks in a way that makes it hard to do one without the other.
Perhaps. But I have not yet seen this reason clearly expressed. Gradient descent doesn’t automatically pick the global optima. It just lands in one semi-arbitrary local optima.
Gradient descent doesn’t just exclude some part of the neurons, it automatically checks everything for improvements. Would you expect some part of the net to be left blank, because “a large neural net has a lot of spare neurons”?
Keep in mind that the neural net doesn’t respect the lines we put on it. We can draw a line and say “here these neurons are doing some complicated inseparable combination of paperclip maximizing and other capabilities” but gradient descent doesn’t care, it reaches in and adjusts every weight.
Can you concoct even a vague or toy model of how what you propose could possibly be a local optimum?
My intuition is also in part informed by: https://www.lesswrong.com/posts/fovfuFdpuEwQzJu2w/neural-networks-generalize-because-of-this-one-weird-trick
If the lottery ticket hypothesis is true, yes.
The lottery ticket hypothesis is that some parts of the network start off doing something somewhat close to useful, and get trained towards usefulness. And some parts start off sufficiently un-useful that they just get trained to get out of the way.
Which fits with neural net distillation being a thing. (Ie training a big network, and then condensing it into a smaller network gives better performance than directly training a small network.
Here is an extreme example. Suppose the current parameters were implementing a computer chip, on which was running a holomorphically encrypted piece of code.
Holomorphic encryption itself is unlikely to form, but it serves at least as an existance proof for computational structures that can’t be adjusted with local optimization.
Basically the problem with gradient descent is that it’s local. And when the same neurons are doing things that the neural net does want, and things that the neural net doesn’t want (but doesn’t dis-want either) then its possible for the network to be trapped in a local optimum. Any small change to get rid of the bad behavior would also get rid of the good behavior.
Also, any bad behavior that only very rarely effects the output will produce very small gradients. Neural nets are trained for finite time. It’s possible that gradient descent just hasn’t got around to removing the bad behavior even if it would do so eventually.
You can make any algorithm that does better than chance into a local optimum on a sufficiently large neural net. Holomorphicly encrypt that algorithm, Any small change and the whole thing collapses into nonsense. Well actually, this involves discrete bits. But suppose the neurons have strong regularization to stop the values getting too large (past + or − 1) , and they also have uniform [0,1] noise added to them, so each neuron can store 1 bit and any attempt to adjust parameters immediately risks errors.
Looking at the article you linked. One simplification is that neural networks tend towards the max-entropy way to solve the problem. If there are multiple solutions, the solutions with more free parameters are more likely.
And there are few ways to predict next tokens, but lots of different kinds of paperclips the AI could want.
Some interesting points there. The lottery ticket hypothesis does make it more plausible that side computations could persist longer if they come to exist outside the main computation.
Regarding the homomorphic encryption thing: yes, it does seem that it might be impossible to make small adjustments to the homomorphically encrypted computation without wrecking it. Technically I don’t think that would be a local minimum since I’d expect the net would start memorizing the failure cases, but I suppose that the homomorphic computation combined with memorizations might be a local optimum particularly if the input and output are encrypted outside the network itself.
So I concede the point on the possible persistence of an underlying goal if it were to come to exist, though not on it coming to exist in the first place.
For most computations, there are many more ways for that computation to occur than there are ways for that computation to occur while also including anything resembling actual goals about the real world. Now, if the computation you are carrying out is such that it needs to determine how to achieve goals regarding the real world anyway (e.g. agentic mask), it only takes a small increase in complexity to have that computation apply outside the normal context. So, that’s the mask takeover possibility again. Even so, no matter how small the increase in complexity, that extra step isn’t likely to be reinforced in training, unless it can do self-modification or control the training environment.
As well as agentic masks, there are uses for within network goal directed steps. (Ie like an optimizing compiler. A list of hashed followed by unhashed values isn’t particularly agenty. But the network needs to solve an optimization problem to reverse the hashes. Something it can use the goal directed reasoning section to do.
Neither of those would (immediately) lead to real world goals, because they aren’t targeted at real world state (an optimizing compiler is trying to output a fast program—it isn’t trying to create a world state such that the fast program exists). That being said, an optimizing compiler could open a path to potentially dangerous self-improvement, where it preserves/amplifies any agency there might actually be in its own code.