Do you think that the next-token prediction objective would lead to instrumentally convergent goals and/or power-seeking behavior? I could imagine an argument that a model would want to seize all available computation in order to better predict the next token. Perhaps the model’s reward schedule would never lead it to think about such paths to loss reduction, but if the model is creative enough to consider such a plan and powerful enough to execute it, it seems that many power-seeking plans would help achieve its goal. This is significantly different from the view advanced by OpenAI, that language models are tools which avoid some central dangers of RL agents, and the general distinction drawn between tool AI and agentic AI.
Do you think that the next-token prediction objective would lead to instrumentally convergent goals and/or power-seeking behavior?
No. Simulators aren’t (in general) agents. Language models were optimised for the task of next token prediction, but they don’t necessarily optimise for it. I am not convinced that their selection pressure favoured agents vs a more general cognitive architecture that can predict agents (and other kinds of systems).
Furthermore, insomuch as they are actually optimisers for next token prediction, it’s in a very myopic way. That is, I don’t think language models will take actions to make future tokens easier to predict
I don’t think language models will take actions to make future tokens easier to predict
For an analogy, look at recommender systems. Their objective is myopic in the same way as language models: predict which recommendation which most likely result in a click. Yet they have power seeking strategies available, such as shifting the preferences of a user to make their behavior easier to predict. These incentives are well documented and simulations confirm the predictions here and here. The real world evidence is scant—a study of YouTube’s supposed radicalization spiral came up negative, though the authors didn’t log in to YouTube which could lead to less personalization of recommendations.
The jury is out on whether current recommender systems execute power-seeking strategies to improve their supposedly myopic objective. But the incentive and means are clearly present, and to me it seems only a matter of time before we observe this behavior in the wild. Similarly, while I don’t think current language models are creative or capable enough to execute a power seeking strategy, it seems like power seeking by a superintelligent language model would be rewarded with lower loss. If a language model could use its outputs to persuade humans to train it with more compute on more data thereby reducing its loss, there seems to be every incentive for the model to seek power in this way.
As I understand it, GPT-3 and co are trained via self supervised learning with the goal of minimising predictive loss. During training, their actions/predictions do not influence their future observations in anyway. The training process does not select for trying to control/alter text input, because that is something impossible for the AI to accomplish during training.
As such, we shouldn’t expect the AI to demonstrate such behaviour. It was not selected for power seeking.
Do you think that the next-token prediction objective would lead to instrumentally convergent goals and/or power-seeking behavior? I could imagine an argument that a model would want to seize all available computation in order to better predict the next token. Perhaps the model’s reward schedule would never lead it to think about such paths to loss reduction, but if the model is creative enough to consider such a plan and powerful enough to execute it, it seems that many power-seeking plans would help achieve its goal. This is significantly different from the view advanced by OpenAI, that language models are tools which avoid some central dangers of RL agents, and the general distinction drawn between tool AI and agentic AI.
No. Simulators aren’t (in general) agents. Language models were optimised for the task of next token prediction, but they don’t necessarily optimise for it. I am not convinced that their selection pressure favoured agents vs a more general cognitive architecture that can predict agents (and other kinds of systems).
Furthermore, insomuch as they are actually optimisers for next token prediction, it’s in a very myopic way. That is, I don’t think language models will take actions to make future tokens easier to predict
For an analogy, look at recommender systems. Their objective is myopic in the same way as language models: predict which recommendation which most likely result in a click. Yet they have power seeking strategies available, such as shifting the preferences of a user to make their behavior easier to predict. These incentives are well documented and simulations confirm the predictions here and here. The real world evidence is scant—a study of YouTube’s supposed radicalization spiral came up negative, though the authors didn’t log in to YouTube which could lead to less personalization of recommendations.
The jury is out on whether current recommender systems execute power-seeking strategies to improve their supposedly myopic objective. But the incentive and means are clearly present, and to me it seems only a matter of time before we observe this behavior in the wild. Similarly, while I don’t think current language models are creative or capable enough to execute a power seeking strategy, it seems like power seeking by a superintelligent language model would be rewarded with lower loss. If a language model could use its outputs to persuade humans to train it with more compute on more data thereby reducing its loss, there seems to be every incentive for the model to seek power in this way.
As I understand it, GPT-3 and co are trained via self supervised learning with the goal of minimising predictive loss. During training, their actions/predictions do not influence their future observations in anyway. The training process does not select for trying to control/alter text input, because that is something impossible for the AI to accomplish during training.
As such, we shouldn’t expect the AI to demonstrate such behaviour. It was not selected for power seeking.