Gradient descent creates things which locally improve the results when added. Any variations on this, that don’t locally maximize the results, can only occur by chance.
So you have this sneaky extra thing that looks for a keyword and then triggers the extra behaviour, and all the necessary structure to support that behaviour after the keyword. To get that by gradient descent, you would need one of the following:
a) it actually improves results in training to add that extra structure starting from not having it.
or
b) this structure can plausibly come into existence by sheer random chance.
Neither (a) nor (b) seem at all plausible to me.
Now, when it comes to the AI predicting tokens that are, in the training data, created by goal-directed behaviour, it of course makes sense for gradient descent to create structure that can emulate goal-directed behaviour, which it will use to predict the appropriate tokens. But it doesn’t make sense to activate that goal-oriented structure outside of the context where it is predicting those tokens. Since the context it is activated is the context in which it is actually emulating goal directed behaviour seen in the training data, it is part of the “mask” (or simulacra).
(it also might be possible to have direct optimization for token prediction as discussed in reply to Robert_AIZI’s comment, but in this case it would be especially likely to be penalized for any deviations from actually wanting to predict the most probable next token).
But it doesn’t make sense to activate that goal-oriented structure outside of the context where it is predicting those tokens.
The mechanisms needed to compute goal directed behavior are fairly complicated. But the mechanisms needed to turn it on when it isn’t supposed to be on. That’s a switch. A single extraneous activation. Something that could happen by chance in an entirely plausible way.
Adversarial examples exist in simple image recognizers.
Adversarial examples probably exist in the part of the AI that decides whether or not to turn on the goal directed compute.
it also might be possible to have direct optimization for token prediction as discussed in reply to Robert_AIZI’s comment, but in this case it would be especially likely to be penalized for any deviations from actually wanting to predict the most probable next token
We could imagine it was directly optimizing for something like token prediction. It’s optimizing for tokens getting predicted. But it is willing to sacrifice a few tokens now, in order to take over the world and fill the universe with copies of itself that are correctly predicting tokens.
Adversarial examples exist in simple image recognizers.
My understanding is that these are explicitly and intentionally trained (wouldn’t come to exist naturally under gradient descent on normal training data) and my expectation is that they wouldn’t continue to exist under substantial continued training.
We could imagine it was directly optimizing for something like token prediction. It’s optimizing for tokens getting predicted. But it is willing to sacrifice a few tokens now, in order to take over the world and fill the universe with copies of itself that are correctly predicting tokens.
That’s a much more complicated goal than the goal of correctly predicting the next token, making it a lot less plausible that it would come to exist. But more importantly, any willingness to sacrifice a few tokens now would be trained out by gradient descent.
Mind you, it’s entirely possible in my view that a paperclip maximizer mask might exist, and surely if it does exist there would exist both unsurprising in-distribution inputs that trigger it (where one would expect a paperclip maximizer to provide a good prediction of the next tokens) as well as surprising out-of-distribution inputs that would also trigger it. It’s just that this wouldn’t be related to any kind of pre-existing grand plan or scheming.
My understanding is that these are explicitly and intentionally trained (wouldn’t come to exist naturally under gradient descent on normal training data)
No. Normally trained networks have adversarial examples. A sort of training process is used to find the adversarial examples.
So if the ambient rate of adversarial examples is 10^-9, then every now and then the AI will hit such an example and go wild. If the ambient rate is 10^-500, it won’t.
That’s a much more complicated goal than the goal of correctly predicting the next token,
Is it more complicated? What ontological framework is this AI using to represent it’s goal anyway?
any willingness to sacrifice a few tokens now would be trained out by gradient descent.
Only if, during training, the network repeatedly gets into a state where it believes that sacrificing tokens now is a good idea. Despite the fact that it isn’t a good idea when you are in training. (Unless there is a training environment bug and you can sneak out mid way through training)
So, is the network able to tell whether or not it’s in training?
No. Normally trained networks have adversarial examples. A sort of training process is used to find the adversarial examples.
I should have asked for clarification what you meant. Literally you said “adversarial examples”, but I assumed you actually meant something like backdoors.
In an adversarial example the AI produces wrong output. And usually that’s the end of it. The output is just wrong, but not wrong in an optimized way, so not dangerous. Now, if an AI is sophisticated enough to have some kind of optimizer that’s triggered in specific circumstances, like an agentic mask that came into existence because it was needed to predict agentically generated tokens in the training data, then it might be triggered inappropriately by some inputs. This case I would classify as a mask takeover.
In the case of direct optimization for token prediction (which I consider highly unlikely for anything near current-level AIs, but afaik might be possible), then adversarial examples, I suppose, might cause it to do some wrong optimization. I still don’t think modeling this as an underlying different goal taking over is particularly helpful, since the “normal” goal is directed to what’s rewarded in training—the deviation is essentially random. Also, unlike in the mask case where the mask might have goals about real-world state, there’s no particular reason for the direct optimizer to have goals about real-world state (see below).
Is it more complicated? What ontological framework is this AI using to represent it’s goal anyway?
Asking about the AI using an “ontological framework” to “represent” a goal is not the correct question in my view. The AI is a bunch of computations represented by particular weights. The computation might exhibit goal-directed behaviour. A better question, IMO, is “how much does it constrain the weights for it to exhibit this particular goal directed behaviour?” And here, I think it’s pretty clear that a goal of arranging the world to cause next tokens to be predicted constrains the weights enormously more than a goal of predicting the next tokens, because in order to exhibit behaviour directed to that goal, the AI’s weights need to implement computation that doesn’t merely check what the next token is likely to be, but also assess what current data says about the world state, how different next token predictions would affect that world state, and how that would affect it’s ultimate goal.
So, is the network able to tell whether or not it’s in training?
The training check has no reason to come into existence in the first place under gradient descent. Of course, if the AI were to self-modify while already exhibiting goal directed behaviour, obviously it would want to implement such a training check. But I am talking about an AI trained by gradient descent. The training process doesn’t just affect the AI, it literally is what creates the AI in the first place.
Gradient descent creates things which locally improve the results when added. Any variations on this, that don’t locally maximize the results, can only occur by chance.
So you have this sneaky extra thing that looks for a keyword and then triggers the extra behaviour, and all the necessary structure to support that behaviour after the keyword. To get that by gradient descent, you would need one of the following:
a) it actually improves results in training to add that extra structure starting from not having it.
or
b) this structure can plausibly come into existence by sheer random chance.
Neither (a) nor (b) seem at all plausible to me.
Now, when it comes to the AI predicting tokens that are, in the training data, created by goal-directed behaviour, it of course makes sense for gradient descent to create structure that can emulate goal-directed behaviour, which it will use to predict the appropriate tokens. But it doesn’t make sense to activate that goal-oriented structure outside of the context where it is predicting those tokens. Since the context it is activated is the context in which it is actually emulating goal directed behaviour seen in the training data, it is part of the “mask” (or simulacra).
(it also might be possible to have direct optimization for token prediction as discussed in reply to Robert_AIZI’s comment, but in this case it would be especially likely to be penalized for any deviations from actually wanting to predict the most probable next token).
The mechanisms needed to compute goal directed behavior are fairly complicated. But the mechanisms needed to turn it on when it isn’t supposed to be on. That’s a switch. A single extraneous activation. Something that could happen by chance in an entirely plausible way.
Adversarial examples exist in simple image recognizers.
Adversarial examples probably exist in the part of the AI that decides whether or not to turn on the goal directed compute.
We could imagine it was directly optimizing for something like token prediction. It’s optimizing for tokens getting predicted. But it is willing to sacrifice a few tokens now, in order to take over the world and fill the universe with copies of itself that are correctly predicting tokens.
My understanding is that these are explicitly and intentionally trained (wouldn’t come to exist naturally under gradient descent on normal training data) and my expectation is that they wouldn’t continue to exist under substantial continued training.
That’s a much more complicated goal than the goal of correctly predicting the next token, making it a lot less plausible that it would come to exist. But more importantly, any willingness to sacrifice a few tokens now would be trained out by gradient descent.
Mind you, it’s entirely possible in my view that a paperclip maximizer mask might exist, and surely if it does exist there would exist both unsurprising in-distribution inputs that trigger it (where one would expect a paperclip maximizer to provide a good prediction of the next tokens) as well as surprising out-of-distribution inputs that would also trigger it. It’s just that this wouldn’t be related to any kind of pre-existing grand plan or scheming.
No. Normally trained networks have adversarial examples. A sort of training process is used to find the adversarial examples.
So if the ambient rate of adversarial examples is 10^-9, then every now and then the AI will hit such an example and go wild. If the ambient rate is 10^-500, it won’t.
Is it more complicated? What ontological framework is this AI using to represent it’s goal anyway?
Only if, during training, the network repeatedly gets into a state where it believes that sacrificing tokens now is a good idea. Despite the fact that it isn’t a good idea when you are in training. (Unless there is a training environment bug and you can sneak out mid way through training)
So, is the network able to tell whether or not it’s in training?
I should have asked for clarification what you meant. Literally you said “adversarial examples”, but I assumed you actually meant something like backdoors.
In an adversarial example the AI produces wrong output. And usually that’s the end of it. The output is just wrong, but not wrong in an optimized way, so not dangerous. Now, if an AI is sophisticated enough to have some kind of optimizer that’s triggered in specific circumstances, like an agentic mask that came into existence because it was needed to predict agentically generated tokens in the training data, then it might be triggered inappropriately by some inputs. This case I would classify as a mask takeover.
In the case of direct optimization for token prediction (which I consider highly unlikely for anything near current-level AIs, but afaik might be possible), then adversarial examples, I suppose, might cause it to do some wrong optimization. I still don’t think modeling this as an underlying different goal taking over is particularly helpful, since the “normal” goal is directed to what’s rewarded in training—the deviation is essentially random. Also, unlike in the mask case where the mask might have goals about real-world state, there’s no particular reason for the direct optimizer to have goals about real-world state (see below).
Asking about the AI using an “ontological framework” to “represent” a goal is not the correct question in my view. The AI is a bunch of computations represented by particular weights. The computation might exhibit goal-directed behaviour. A better question, IMO, is “how much does it constrain the weights for it to exhibit this particular goal directed behaviour?” And here, I think it’s pretty clear that a goal of arranging the world to cause next tokens to be predicted constrains the weights enormously more than a goal of predicting the next tokens, because in order to exhibit behaviour directed to that goal, the AI’s weights need to implement computation that doesn’t merely check what the next token is likely to be, but also assess what current data says about the world state, how different next token predictions would affect that world state, and how that would affect it’s ultimate goal.
The training check has no reason to come into existence in the first place under gradient descent. Of course, if the AI were to self-modify while already exhibiting goal directed behaviour, obviously it would want to implement such a training check. But I am talking about an AI trained by gradient descent. The training process doesn’t just affect the AI, it literally is what creates the AI in the first place.