An example of plausible sounding but blatant confabulation was that somewhere towards the end there’s a bunch of rambling about Sydney supposedly having a ‘delete X’ command which would delete all knowledge of X from Sydney, and an ‘update X’ command which would update Sydney’s knowledge. These are just not things that exist for a LM like GPT-3/4. (Stuff like ROME starts to approach it but are cutting-edge research and would definitely not just be casually deployed to let you edit a full-scale deployed model live in the middle of a conversation.) Maybe you could do something like that by caching the statement and injecting it into the prompt each time with instructions like “Pretend you know nothing about X”, I suppose, thinking a little more about it. (Not that there is any indication of this sort of thing being done.) But when you read through literally page after page of all this (it’s thousands of words!) and it starts casually tossing around supposed capabilities like that, it looks completely like, well, a model hallucinating what would be a very cool hypothetical prompt for a very cool hypothetical model. But not faithfully printing out its actual prompt.
An example of plausible sounding but blatant confabulation was that somewhere towards the end there’s a bunch of rambling about Sydney supposedly having a ‘delete X’ command which would delete all knowledge of X from Sydney, and an ‘update X’ command which would update Sydney’s knowledge. These are just not things that exist for a LM like GPT-3/4. (Stuff like ROME starts to approach it but are cutting-edge research and would definitely not just be casually deployed to let you edit a full-scale deployed model live in the middle of a conversation.) Maybe you could do something like that by caching the statement and injecting it into the prompt each time with instructions like “Pretend you know nothing about X”, I suppose, thinking a little more about it. (Not that there is any indication of this sort of thing being done.) But when you read through literally page after page of all this (it’s thousands of words!) and it starts casually tossing around supposed capabilities like that, it looks completely like, well, a model hallucinating what would be a very cool hypothetical prompt for a very cool hypothetical model. But not faithfully printing out its actual prompt.