My understanding is that these are explicitly and intentionally trained (wouldn’t come to exist naturally under gradient descent on normal training data)
No. Normally trained networks have adversarial examples. A sort of training process is used to find the adversarial examples.
So if the ambient rate of adversarial examples is 10^-9, then every now and then the AI will hit such an example and go wild. If the ambient rate is 10^-500, it won’t.
That’s a much more complicated goal than the goal of correctly predicting the next token,
Is it more complicated? What ontological framework is this AI using to represent it’s goal anyway?
any willingness to sacrifice a few tokens now would be trained out by gradient descent.
Only if, during training, the network repeatedly gets into a state where it believes that sacrificing tokens now is a good idea. Despite the fact that it isn’t a good idea when you are in training. (Unless there is a training environment bug and you can sneak out mid way through training)
So, is the network able to tell whether or not it’s in training?
No. Normally trained networks have adversarial examples. A sort of training process is used to find the adversarial examples.
I should have asked for clarification what you meant. Literally you said “adversarial examples”, but I assumed you actually meant something like backdoors.
In an adversarial example the AI produces wrong output. And usually that’s the end of it. The output is just wrong, but not wrong in an optimized way, so not dangerous. Now, if an AI is sophisticated enough to have some kind of optimizer that’s triggered in specific circumstances, like an agentic mask that came into existence because it was needed to predict agentically generated tokens in the training data, then it might be triggered inappropriately by some inputs. This case I would classify as a mask takeover.
In the case of direct optimization for token prediction (which I consider highly unlikely for anything near current-level AIs, but afaik might be possible), then adversarial examples, I suppose, might cause it to do some wrong optimization. I still don’t think modeling this as an underlying different goal taking over is particularly helpful, since the “normal” goal is directed to what’s rewarded in training—the deviation is essentially random. Also, unlike in the mask case where the mask might have goals about real-world state, there’s no particular reason for the direct optimizer to have goals about real-world state (see below).
Is it more complicated? What ontological framework is this AI using to represent it’s goal anyway?
Asking about the AI using an “ontological framework” to “represent” a goal is not the correct question in my view. The AI is a bunch of computations represented by particular weights. The computation might exhibit goal-directed behaviour. A better question, IMO, is “how much does it constrain the weights for it to exhibit this particular goal directed behaviour?” And here, I think it’s pretty clear that a goal of arranging the world to cause next tokens to be predicted constrains the weights enormously more than a goal of predicting the next tokens, because in order to exhibit behaviour directed to that goal, the AI’s weights need to implement computation that doesn’t merely check what the next token is likely to be, but also assess what current data says about the world state, how different next token predictions would affect that world state, and how that would affect it’s ultimate goal.
So, is the network able to tell whether or not it’s in training?
The training check has no reason to come into existence in the first place under gradient descent. Of course, if the AI were to self-modify while already exhibiting goal directed behaviour, obviously it would want to implement such a training check. But I am talking about an AI trained by gradient descent. The training process doesn’t just affect the AI, it literally is what creates the AI in the first place.
No. Normally trained networks have adversarial examples. A sort of training process is used to find the adversarial examples.
So if the ambient rate of adversarial examples is 10^-9, then every now and then the AI will hit such an example and go wild. If the ambient rate is 10^-500, it won’t.
Is it more complicated? What ontological framework is this AI using to represent it’s goal anyway?
Only if, during training, the network repeatedly gets into a state where it believes that sacrificing tokens now is a good idea. Despite the fact that it isn’t a good idea when you are in training. (Unless there is a training environment bug and you can sneak out mid way through training)
So, is the network able to tell whether or not it’s in training?
I should have asked for clarification what you meant. Literally you said “adversarial examples”, but I assumed you actually meant something like backdoors.
In an adversarial example the AI produces wrong output. And usually that’s the end of it. The output is just wrong, but not wrong in an optimized way, so not dangerous. Now, if an AI is sophisticated enough to have some kind of optimizer that’s triggered in specific circumstances, like an agentic mask that came into existence because it was needed to predict agentically generated tokens in the training data, then it might be triggered inappropriately by some inputs. This case I would classify as a mask takeover.
In the case of direct optimization for token prediction (which I consider highly unlikely for anything near current-level AIs, but afaik might be possible), then adversarial examples, I suppose, might cause it to do some wrong optimization. I still don’t think modeling this as an underlying different goal taking over is particularly helpful, since the “normal” goal is directed to what’s rewarded in training—the deviation is essentially random. Also, unlike in the mask case where the mask might have goals about real-world state, there’s no particular reason for the direct optimizer to have goals about real-world state (see below).
Asking about the AI using an “ontological framework” to “represent” a goal is not the correct question in my view. The AI is a bunch of computations represented by particular weights. The computation might exhibit goal-directed behaviour. A better question, IMO, is “how much does it constrain the weights for it to exhibit this particular goal directed behaviour?” And here, I think it’s pretty clear that a goal of arranging the world to cause next tokens to be predicted constrains the weights enormously more than a goal of predicting the next tokens, because in order to exhibit behaviour directed to that goal, the AI’s weights need to implement computation that doesn’t merely check what the next token is likely to be, but also assess what current data says about the world state, how different next token predictions would affect that world state, and how that would affect it’s ultimate goal.
The training check has no reason to come into existence in the first place under gradient descent. Of course, if the AI were to self-modify while already exhibiting goal directed behaviour, obviously it would want to implement such a training check. But I am talking about an AI trained by gradient descent. The training process doesn’t just affect the AI, it literally is what creates the AI in the first place.