Maybe one upside to the influx of “agents made with GPT-N API calls and software glue” is that these types of AI agents are more likely to cause a fire alarm-y disaster which gets mitigated, thus spurring governments to take X-risk more seriously, as opposed to other types of AI agents, whose first disaster would blow right past fire alarm level straight to world-ending level?
For example, I think this situation is plausible: ~AutoGPT-N[1] hacks into a supercomputer cluster or social-engineers IT workers over email or whatever in the pursuit of some other goal, but ultimately gets shut down by OpenAI simply banning the agent from using their API. Maybe it even succeeds in some scarier instrumental goal, like obtaining more API keys and spawning multiple instances of itself. However, the crucial detail is that the main “cognitive engine” of the agent is bottlenecked by API calls, so for the agent to wipe everyone out, it needs to overcome the hurdle of pwning OpenAI specifically.
By contrast, if an agent that’s powered by an open-source language model gets to the “scary fire alarm” level of self-improvement/power-seeking, it might be too late, since it wouldn’t have a “stop button” controlled by one corporation like ~AutoGPT-N has. It could continue spinning up instances of itself while staying under the radar.
This isn’t to say that ~AutoGPT-N doesn’t pose any X-risk at all, but rather that it seems like it could cause the kind of disaster which doesn’t literally kill everyone but which is scary enough that the public freaks out and nations form treaties banning larger models from being trained, et cetera.
I’d like to make it very clear that I do not think it is a good thing that this type of agent might cause a disaster. Rather, I think it’s good that the first major disaster these agents will cause seems likely to be non-existential.
AutoGPT-N hacks into a supercomputer cluster or social-engineers IT workers over email or whatever in the pursuit of some other goal, but ultimately gets shut down by OpenAI simply banning the agent from using their API.
Could we even identify who did it to know that it was an instance of AutoGPT?
This was my first thought on seeing AutoGPT. I wrote about this in AI scares and changing public beliefs. But my second thought was that this is much more important. Not only might it work very well, it has immense advantages for initial alignment and corrigibility. This is potentially really good news.
Maybe one upside to the influx of “agents made with GPT-N API calls and software glue” is that these types of AI agents are more likely to cause a fire alarm-y disaster which gets mitigated, thus spurring governments to take X-risk more seriously, as opposed to other types of AI agents, whose first disaster would blow right past fire alarm level straight to world-ending level?
For example, I think this situation is plausible: ~AutoGPT-N[1] hacks into a supercomputer cluster or social-engineers IT workers over email or whatever in the pursuit of some other goal, but ultimately gets shut down by OpenAI simply banning the agent from using their API. Maybe it even succeeds in some scarier instrumental goal, like obtaining more API keys and spawning multiple instances of itself. However, the crucial detail is that the main “cognitive engine” of the agent is bottlenecked by API calls, so for the agent to wipe everyone out, it needs to overcome the hurdle of pwning OpenAI specifically.
By contrast, if an agent that’s powered by an open-source language model gets to the “scary fire alarm” level of self-improvement/power-seeking, it might be too late, since it wouldn’t have a “stop button” controlled by one corporation like ~AutoGPT-N has. It could continue spinning up instances of itself while staying under the radar.
This isn’t to say that ~AutoGPT-N doesn’t pose any X-risk at all, but rather that it seems like it could cause the kind of disaster which doesn’t literally kill everyone but which is scary enough that the public freaks out and nations form treaties banning larger models from being trained, et cetera.
I’d like to make it very clear that I do not think it is a good thing that this type of agent might cause a disaster. Rather, I think it’s good that the first major disaster these agents will cause seems likely to be non-existential.
Future iteration of AutoGPT or a similar project
Could we even identify who did it to know that it was an instance of AutoGPT?
This was my first thought on seeing AutoGPT. I wrote about this in AI scares and changing public beliefs. But my second thought was that this is much more important. Not only might it work very well, it has immense advantages for initial alignment and corrigibility. This is potentially really good news.