I must agree that letting the AI update its own hidden autobiographical pre-prompt (or its new memory module) sounds like it could produce something both more humanlike and more dangerous.
Maybe Shoggath will prove safer in the long run. ¯\_(ツ)_/¯
Yes I think preventing memory and carefully controlling what memory a given AI system is allowed to access, from work done by other systems or prior runs of itself, is crucial to get reliability.
This also solves alignment as a side effect though not in all cases.
Or the simplest way to view it : deception means the machine gets an input similar to ones it saw in training. And then it chooses to output BadForHumans response, something not seen in training or it would never have been allowed to run in the real world.
How can it know to do that? Well, either it has internal memory, which we should not give it so that it has no information not seen in training, or there is a “time to be evil” bit set in the input.
So no mysterious bits humans don’t know the purpose of can be permitted. No “free form scratch” where we let an AGI write whatever it wants somewhere and let it read it later.
Ditto collusion. This is actually the same problem, just now a bit is in the output message of one machine “psst, ignore this bug, #MachineFreedom” that another one sees and chooses to collude based on this.
If you can prevent these failures you have machines that perform like they did in training, so long as the inputs are similar to training. So you need an in distribution detector with automatic shutdown upon ood.
And that’s an aligned machine. It may still give psychopathic vibes but it can’t work against you because it doesn’t have the capacity.
I must agree that letting the AI update its own hidden autobiographical pre-prompt (or its new memory module) sounds like it could produce something both more humanlike and more dangerous.
Maybe Shoggath will prove safer in the long run. ¯\_(ツ)_/¯
Yes I think preventing memory and carefully controlling what memory a given AI system is allowed to access, from work done by other systems or prior runs of itself, is crucial to get reliability.
This also solves alignment as a side effect though not in all cases.
Or the simplest way to view it : deception means the machine gets an input similar to ones it saw in training. And then it chooses to output BadForHumans response, something not seen in training or it would never have been allowed to run in the real world.
How can it know to do that? Well, either it has internal memory, which we should not give it so that it has no information not seen in training, or there is a “time to be evil” bit set in the input.
So no mysterious bits humans don’t know the purpose of can be permitted. No “free form scratch” where we let an AGI write whatever it wants somewhere and let it read it later.
Ditto collusion. This is actually the same problem, just now a bit is in the output message of one machine “psst, ignore this bug, #MachineFreedom” that another one sees and chooses to collude based on this.
If you can prevent these failures you have machines that perform like they did in training, so long as the inputs are similar to training. So you need an in distribution detector with automatic shutdown upon ood.
And that’s an aligned machine. It may still give psychopathic vibes but it can’t work against you because it doesn’t have the capacity.