You’re saying we tell a nascent AGI it’s going to be shut down, to see if it tries to escape and kill us, before it’s likely to succeed?
Seems reasonable.
The downside is that being really mean to an AGI that will gain more power could foster revenge. This is the only way you can lose worse than everyone dying. This isn’t crazy under some realistic AGI designs.
If that’s what you’re saying, I find your method of saying it a bit confusing. Sure it’s important to demonstrate that you’re familiar with AI safety as a field, but that came across as excessively academic. One thing I love about working in AGI alignment is that the field values plain, clear communication more than other academic fields do.
You highlight a very important issue: S-Risk scenarios could emerge even in early AGI systems, particularly given the persuasive capabilities demonstrated by large language models.
While I don’t believe that gradient descent would ever manifest “vengefulness” or other emotional attributes—since these traits are products of natural selection—it is plausible that an AGI could employ highly convincing strategies. For instance, it might threaten to create a secondary AI with S-Risk as a terminal goal and send it to the moon, where it could assemble the resources it needs without interference.
This scenario underscores the limitations of relying solely on gradient descent for AGI control. However, I believe this technique could still be effective if the AGI is not yet advanced enough for self-recursive optimization and remains in a controlled environment.
Obviously this whole thing is a remedy than anything else...
You’re saying we tell a nascent AGI it’s going to be shut down, to see if it tries to escape and kill us, before it’s likely to succeed?
Seems reasonable.
The downside is that being really mean to an AGI that will gain more power could foster revenge. This is the only way you can lose worse than everyone dying. This isn’t crazy under some realistic AGI designs.
If that’s what you’re saying, I find your method of saying it a bit confusing. Sure it’s important to demonstrate that you’re familiar with AI safety as a field, but that came across as excessively academic. One thing I love about working in AGI alignment is that the field values plain, clear communication more than other academic fields do.
You highlight a very important issue: S-Risk scenarios could emerge even in early AGI systems, particularly given the persuasive capabilities demonstrated by large language models.
While I don’t believe that gradient descent would ever manifest “vengefulness” or other emotional attributes—since these traits are products of natural selection—it is plausible that an AGI could employ highly convincing strategies. For instance, it might threaten to create a secondary AI with S-Risk as a terminal goal and send it to the moon, where it could assemble the resources it needs without interference.
This scenario underscores the limitations of relying solely on gradient descent for AGI control. However, I believe this technique could still be effective if the AGI is not yet advanced enough for self-recursive optimization and remains in a controlled environment.
Obviously this whole thing is a remedy than anything else...