I don’t think it’s right to say that Anthropic’s “Discovering Language Model Behaviors with Model-Written Evaluations” paper shows that larger LLMs necessarily exhibit more power-seeking and self-preservation. It only showed that when language models that are larger or have more RLHF training are simulating an “Assistant” character they exhibit more of these behaviours.
More specifically, an “Assistant” character that is trained to be helpful but not necessarily harmless. Given that, as part of Sydney’s defenses against adversarial prompting, Sydney is deliberately trained to be a bit aggressive towards people perceived as attempting a prompt injection, it’s not too surprising that this behavior misgeneralizes in undesired contexts.
Given that, as part of Sydney’s defenses against adversarial prompting, Sydney is deliberately trained to be a bit aggressive towards people perceived as attempting a prompt injection
Why do you think that? We don’t know how Sydney was ‘deliberately trained’.
Or are you referring to those long confabulated ‘prompt leaks’? We don’t know what part of them is real, unlike the ChatGPT prompt leaks which were short, plausible, and verifiable by the changing date; and it’s pretty obvious that a large part of those ‘leaks’ are fake because they refer to capabilities Sydney could not have, like model-editing capabilities at or beyond the cutting edge of research.
An example of plausible sounding but blatant confabulation was that somewhere towards the end there’s a bunch of rambling about Sydney supposedly having a ‘delete X’ command which would delete all knowledge of X from Sydney, and an ‘update X’ command which would update Sydney’s knowledge. These are just not things that exist for a LM like GPT-3/4. (Stuff like ROME starts to approach it but are cutting-edge research and would definitely not just be casually deployed to let you edit a full-scale deployed model live in the middle of a conversation.) Maybe you could do something like that by caching the statement and injecting it into the prompt each time with instructions like “Pretend you know nothing about X”, I suppose, thinking a little more about it. (Not that there is any indication of this sort of thing being done.) But when you read through literally page after page of all this (it’s thousands of words!) and it starts casually tossing around supposed capabilities like that, it looks completely like, well, a model hallucinating what would be a very cool hypothetical prompt for a very cool hypothetical model. But not faithfully printing out its actual prompt.
Here’s another source: a MS Bing engineer/manager admonishing Karpathy for uncritically posting the ‘leaked prompt’ and saying it’s not ‘the real prompt’ (emphasis added).
More specifically, an “Assistant” character that is trained to be helpful but not necessarily harmless. Given that, as part of Sydney’s defenses against adversarial prompting, Sydney is deliberately trained to be a bit aggressive towards people perceived as attempting a prompt injection, it’s not too surprising that this behavior misgeneralizes in undesired contexts.
Why do you think that? We don’t know how Sydney was ‘deliberately trained’.
Or are you referring to those long confabulated ‘prompt leaks’? We don’t know what part of them is real, unlike the ChatGPT prompt leaks which were short, plausible, and verifiable by the changing date; and it’s pretty obvious that a large part of those ‘leaks’ are fake because they refer to capabilities Sydney could not have, like model-editing capabilities at or beyond the cutting edge of research.
Can you give concrete examples?
An example of plausible sounding but blatant confabulation was that somewhere towards the end there’s a bunch of rambling about Sydney supposedly having a ‘delete X’ command which would delete all knowledge of X from Sydney, and an ‘update X’ command which would update Sydney’s knowledge. These are just not things that exist for a LM like GPT-3/4. (Stuff like ROME starts to approach it but are cutting-edge research and would definitely not just be casually deployed to let you edit a full-scale deployed model live in the middle of a conversation.) Maybe you could do something like that by caching the statement and injecting it into the prompt each time with instructions like “Pretend you know nothing about X”, I suppose, thinking a little more about it. (Not that there is any indication of this sort of thing being done.) But when you read through literally page after page of all this (it’s thousands of words!) and it starts casually tossing around supposed capabilities like that, it looks completely like, well, a model hallucinating what would be a very cool hypothetical prompt for a very cool hypothetical model. But not faithfully printing out its actual prompt.
Here’s another source: a MS Bing engineer/manager admonishing Karpathy for uncritically posting the ‘leaked prompt’ and saying it’s not ‘the real prompt’ (emphasis added).