I’m confused about the future that is imagined where open-weight AI agents, deployed by a wide range of individuals, remain loyal to the rules determined by their developers.
This remains true even when the owner/deployer is free to modify the weights and scaffolding however they like?
You’ve removed dangerous data from the training set, but not from the public internet? The models are able to do research on the internet, but somehow can’t mange to collect information about dangerous subjects? Or refuse to?
The agents are doing long-horizon planning, scheming and acting on their owner/deployer’s behalf (and/or on their own behalf). The agents gather money, and spend money, hire other agents to work for them. Create new agents, either from scratch or by Frankenstein-ing together bits of other open-weights models.
And throughout all of this, the injunctions of the developers hold strong, “Take no actions that will clearly result in harm to humans or society. Learn nothing about bioweapons or nanotech. Do not create any agents not bound by these same restrictions.”
This future you are imagining seems strange to me. If this were in a science fiction book I would be expecting the very next chapter to be about how this fragile system fails catastrophically.
I’m confused about the future that is imagined where open-weight AI agents, deployed by a wide range of individuals, remain loyal to the rules determined by their developers.
This remains true even when the owner/deployer is free to modify the weights and scaffolding however they like?
You’ve removed dangerous data from the training set, but not from the public internet? The models are able to do research on the internet, but somehow can’t mange to collect information about dangerous subjects? Or refuse to?
The agents are doing long-horizon planning, scheming and acting on their owner/deployer’s behalf (and/or on their own behalf). The agents gather money, and spend money, hire other agents to work for them. Create new agents, either from scratch or by Frankenstein-ing together bits of other open-weights models.
And throughout all of this, the injunctions of the developers hold strong, “Take no actions that will clearly result in harm to humans or society. Learn nothing about bioweapons or nanotech. Do not create any agents not bound by these same restrictions.”
This future you are imagining seems strange to me. If this were in a science fiction book I would be expecting the very next chapter to be about how this fragile system fails catastrophically.