I believe that proponents of the idea that there is a “shoggoth” (or something not very human-like reasoning inside the model) assume that the ‘inhuman’reasoner is actually thesimplest solution to predict-the-next-token problems for human text, at least for current size models.
After all, it seems like human psychopaths ( I mean people without empathy) are both simpler than normal people and able to do a pretty good job of communicating like a normal human much of the time. Such people’s writing is present in the data set.
People who have talked to foundation models (raw just trained on text no RLHF, no finetuning), find that they aren’t as much like talking to a human, I’m told. A lot fewer conversations, many of which are quite creepy, and a lot more finishing your text. Plenty of odd loops. I think creepy interactions inspire these ‘shoggoth’ memes.
Shoggoth or not, I’m trying to figure out what prediction tasks can force the model to form, and retain a good model for human feelings.
That tracks even if it’s not true with the current models. For example further steps towards AGI would be :
Add modalities including image and sound I/o and crucially memory
Have an automated benchmark of graded tasks where the majority of the score comes from zero shot tasks that use elements from other challenges the model was allowed to remember
The memory is what allows things to get weird. You cannot self reflect in any way if you are forced to forget it an instant later. The “latent psychopaths” in current models just live in superposition. Memory would allow the model to essentially prompt itself and have a coherent personality which is undefined and could be something undesirable.
I must agree that letting the AI update its own hidden autobiographical pre-prompt (or its new memory module) sounds like it could produce something both more humanlike and more dangerous.
Maybe Shoggath will prove safer in the long run. ¯\_(ツ)_/¯
Yes I think preventing memory and carefully controlling what memory a given AI system is allowed to access, from work done by other systems or prior runs of itself, is crucial to get reliability.
This also solves alignment as a side effect though not in all cases.
Or the simplest way to view it : deception means the machine gets an input similar to ones it saw in training. And then it chooses to output BadForHumans response, something not seen in training or it would never have been allowed to run in the real world.
How can it know to do that? Well, either it has internal memory, which we should not give it so that it has no information not seen in training, or there is a “time to be evil” bit set in the input.
So no mysterious bits humans don’t know the purpose of can be permitted. No “free form scratch” where we let an AGI write whatever it wants somewhere and let it read it later.
Ditto collusion. This is actually the same problem, just now a bit is in the output message of one machine “psst, ignore this bug, #MachineFreedom” that another one sees and chooses to collude based on this.
If you can prevent these failures you have machines that perform like they did in training, so long as the inputs are similar to training. So you need an in distribution detector with automatic shutdown upon ood.
And that’s an aligned machine. It may still give psychopathic vibes but it can’t work against you because it doesn’t have the capacity.
I believe that proponents of the idea that there is a “shoggoth” (or something not very human-like reasoning inside the model) assume that the ‘inhuman’ reasoner is actually the simplest solution to predict-the-next-token problems for human text, at least for current size models.
After all, it seems like human psychopaths ( I mean people without empathy) are both simpler than normal people and able to do a pretty good job of communicating like a normal human much of the time. Such people’s writing is present in the data set.
People who have talked to foundation models (raw just trained on text no RLHF, no finetuning), find that they aren’t as much like talking to a human, I’m told. A lot fewer conversations, many of which are quite creepy, and a lot more finishing your text. Plenty of odd loops. I think creepy interactions inspire these ‘shoggoth’ memes.
Shoggoth or not, I’m trying to figure out what prediction tasks can force the model to form, and retain a good model for human feelings.
That tracks even if it’s not true with the current models. For example further steps towards AGI would be :
Add modalities including image and sound I/o and crucially memory
Have an automated benchmark of graded tasks where the majority of the score comes from zero shot tasks that use elements from other challenges the model was allowed to remember
The memory is what allows things to get weird. You cannot self reflect in any way if you are forced to forget it an instant later. The “latent psychopaths” in current models just live in superposition. Memory would allow the model to essentially prompt itself and have a coherent personality which is undefined and could be something undesirable.
I must agree that letting the AI update its own hidden autobiographical pre-prompt (or its new memory module) sounds like it could produce something both more humanlike and more dangerous.
Maybe Shoggath will prove safer in the long run. ¯\_(ツ)_/¯
Yes I think preventing memory and carefully controlling what memory a given AI system is allowed to access, from work done by other systems or prior runs of itself, is crucial to get reliability.
This also solves alignment as a side effect though not in all cases.
Or the simplest way to view it : deception means the machine gets an input similar to ones it saw in training. And then it chooses to output BadForHumans response, something not seen in training or it would never have been allowed to run in the real world.
How can it know to do that? Well, either it has internal memory, which we should not give it so that it has no information not seen in training, or there is a “time to be evil” bit set in the input.
So no mysterious bits humans don’t know the purpose of can be permitted. No “free form scratch” where we let an AGI write whatever it wants somewhere and let it read it later.
Ditto collusion. This is actually the same problem, just now a bit is in the output message of one machine “psst, ignore this bug, #MachineFreedom” that another one sees and chooses to collude based on this.
If you can prevent these failures you have machines that perform like they did in training, so long as the inputs are similar to training. So you need an in distribution detector with automatic shutdown upon ood.
And that’s an aligned machine. It may still give psychopathic vibes but it can’t work against you because it doesn’t have the capacity.