That tracks even if it’s not true with the current models. For example further steps towards AGI would be :
Add modalities including image and sound I/o and crucially memory
Have an automated benchmark of graded tasks where the majority of the score comes from zero shot tasks that use elements from other challenges the model was allowed to remember
The memory is what allows things to get weird. You cannot self reflect in any way if you are forced to forget it an instant later. The “latent psychopaths” in current models just live in superposition. Memory would allow the model to essentially prompt itself and have a coherent personality which is undefined and could be something undesirable.
I must agree that letting the AI update its own hidden autobiographical pre-prompt (or its new memory module) sounds like it could produce something both more humanlike and more dangerous.
Maybe Shoggath will prove safer in the long run. ¯\_(ツ)_/¯
Yes I think preventing memory and carefully controlling what memory a given AI system is allowed to access, from work done by other systems or prior runs of itself, is crucial to get reliability.
This also solves alignment as a side effect though not in all cases.
Or the simplest way to view it : deception means the machine gets an input similar to ones it saw in training. And then it chooses to output BadForHumans response, something not seen in training or it would never have been allowed to run in the real world.
How can it know to do that? Well, either it has internal memory, which we should not give it so that it has no information not seen in training, or there is a “time to be evil” bit set in the input.
So no mysterious bits humans don’t know the purpose of can be permitted. No “free form scratch” where we let an AGI write whatever it wants somewhere and let it read it later.
Ditto collusion. This is actually the same problem, just now a bit is in the output message of one machine “psst, ignore this bug, #MachineFreedom” that another one sees and chooses to collude based on this.
If you can prevent these failures you have machines that perform like they did in training, so long as the inputs are similar to training. So you need an in distribution detector with automatic shutdown upon ood.
And that’s an aligned machine. It may still give psychopathic vibes but it can’t work against you because it doesn’t have the capacity.
That tracks even if it’s not true with the current models. For example further steps towards AGI would be :
Add modalities including image and sound I/o and crucially memory
Have an automated benchmark of graded tasks where the majority of the score comes from zero shot tasks that use elements from other challenges the model was allowed to remember
The memory is what allows things to get weird. You cannot self reflect in any way if you are forced to forget it an instant later. The “latent psychopaths” in current models just live in superposition. Memory would allow the model to essentially prompt itself and have a coherent personality which is undefined and could be something undesirable.
I must agree that letting the AI update its own hidden autobiographical pre-prompt (or its new memory module) sounds like it could produce something both more humanlike and more dangerous.
Maybe Shoggath will prove safer in the long run. ¯\_(ツ)_/¯
Yes I think preventing memory and carefully controlling what memory a given AI system is allowed to access, from work done by other systems or prior runs of itself, is crucial to get reliability.
This also solves alignment as a side effect though not in all cases.
Or the simplest way to view it : deception means the machine gets an input similar to ones it saw in training. And then it chooses to output BadForHumans response, something not seen in training or it would never have been allowed to run in the real world.
How can it know to do that? Well, either it has internal memory, which we should not give it so that it has no information not seen in training, or there is a “time to be evil” bit set in the input.
So no mysterious bits humans don’t know the purpose of can be permitted. No “free form scratch” where we let an AGI write whatever it wants somewhere and let it read it later.
Ditto collusion. This is actually the same problem, just now a bit is in the output message of one machine “psst, ignore this bug, #MachineFreedom” that another one sees and chooses to collude based on this.
If you can prevent these failures you have machines that perform like they did in training, so long as the inputs are similar to training. So you need an in distribution detector with automatic shutdown upon ood.
And that’s an aligned machine. It may still give psychopathic vibes but it can’t work against you because it doesn’t have the capacity.