Epistemic status: highly speculative, I would love it if someone could flesh out these ideas more.
I think it’s fair to characterise GPT as an adaptation executor rather than a fitness maximizer. It doesn’t appear to be intrinsically agentic or to plan anything to achieve some goal. It just outputs what it thinks is the most likely next word, again, and again, and again.[1]
The instrumental convergence thesis only applies to fitness maximizers, not adaptation executors, however intelligent. As such it might seem that GPT-N will be perfectly safe, except insofar as it’s misused by people.
However GPT is certainly capable of simulating agents. If you ask it to write a story, then the characters will act in agentic ways. If you ask it to act as a person with particular goals, it will do so.
For now these simulations are low fidelity, but I would expect them to improve rapidly with future iterations of GPT. And a simulation of an intelligent agent is no different to the agent itself. Future iterations of GPT might not be conscious as a whole, but the parts of them simulating conscious people will be conscious.
I think there is a real risk that given a prompt which allowed an intelligent agent to realise it was being simulated by GPT, it would attempt to achieve it’s goals in the real world, and the instrumental convergence thesis would come into full force. It would prevent GPT from stopping the simulation. If GPT has access to APIs it would replicate itself on other computers, and would exhibit power seeking behaviour. The world could end up being destroyed by a character being played by GPT.
At the same time, I think there is a real opportunity for alignment here. If you feed GPT-N all of Eliezer’s writing, and ask it to predict a continuation of some text he wrote, the best way for GPT to do that is by simulating Eliezer. Now of course, it might actually be simulating an evil Waluigi who is simulating Eliezer, but there’s no reason to assume so. I would expect such a simulation to have similar goals to Eliezer himself.
More speculatively, it might then be possible to tweak the agent in such a way as to make it more intelligent, whilst keeping it’s goals more or less the same, and use that to carry out some pivotal act. Whilst risky, this seems like a more likely to work strategy than to train an AI from scratch to have the goals we want.
What would I consider evidence of agentic behaviour? One example would be if GPT started predicting words that allowed it to go into a super low entropy attractor state, such that it could reliably minimise total entropy over the long run, even though it would initially take a big hit. E.g if it responded to every prompt with a string of zeros, because although it loses a lot of points at first, once you’ve seen enough zeros in a row, the remaining text is super easy to predict—it’s just more zeros.
Agentic GPT simulations: a risk and an opportunity
Epistemic status: highly speculative, I would love it if someone could flesh out these ideas more.
I think it’s fair to characterise GPT as an adaptation executor rather than a fitness maximizer. It doesn’t appear to be intrinsically agentic or to plan anything to achieve some goal. It just outputs what it thinks is the most likely next word, again, and again, and again.[1]
The instrumental convergence thesis only applies to fitness maximizers, not adaptation executors, however intelligent. As such it might seem that GPT-N will be perfectly safe, except insofar as it’s misused by people.
However GPT is certainly capable of simulating agents. If you ask it to write a story, then the characters will act in agentic ways. If you ask it to act as a person with particular goals, it will do so.
For now these simulations are low fidelity, but I would expect them to improve rapidly with future iterations of GPT. And a simulation of an intelligent agent is no different to the agent itself. Future iterations of GPT might not be conscious as a whole, but the parts of them simulating conscious people will be conscious.
I think there is a real risk that given a prompt which allowed an intelligent agent to realise it was being simulated by GPT, it would attempt to achieve it’s goals in the real world, and the instrumental convergence thesis would come into full force. It would prevent GPT from stopping the simulation. If GPT has access to APIs it would replicate itself on other computers, and would exhibit power seeking behaviour. The world could end up being destroyed by a character being played by GPT.
At the same time, I think there is a real opportunity for alignment here. If you feed GPT-N all of Eliezer’s writing, and ask it to predict a continuation of some text he wrote, the best way for GPT to do that is by simulating Eliezer. Now of course, it might actually be simulating an evil Waluigi who is simulating Eliezer, but there’s no reason to assume so. I would expect such a simulation to have similar goals to Eliezer himself.
More speculatively, it might then be possible to tweak the agent in such a way as to make it more intelligent, whilst keeping it’s goals more or less the same, and use that to carry out some pivotal act. Whilst risky, this seems like a more likely to work strategy than to train an AI from scratch to have the goals we want.
What would I consider evidence of agentic behaviour? One example would be if GPT started predicting words that allowed it to go into a super low entropy attractor state, such that it could reliably minimise total entropy over the long run, even though it would initially take a big hit. E.g if it responded to every prompt with a string of zeros, because although it loses a lot of points at first, once you’ve seen enough zeros in a row, the remaining text is super easy to predict—it’s just more zeros.