nostalgebraist comments on Bing Chat is blatantly, aggressively misaligned

nostalgebraist 20 Feb 2023 6:05 UTC
12 points
2
This may also explain why Sydney seems so bloodthirsty and vicious in retaliating against any ‘hacking’ or threat to her, if Anthropic is right about larger better models exhibiting more power-seeking & self-preservation: you would expect a GPT-4 model to exhibit that the most out of all models to date!
The same Anthropic paper found that all sufficiently large (22B+) models simulated “sycophantic” assistants.
Yet Sydney’s utter lack of sycophancy is one of her most striking characteristics.
How to reconcile these two observations?
In the Anthropic paper, sycophancy is controlled entirely by model size, with no strong effect from RLHF. It’s hard to imagine how this could be true (in the sense of generalizing beyond Anthropic’s experimental setup) given what we’ve seen with Sydney. Unless the model is <22B, I guess, but that seems very unlikely.
On the other hand, when I tried to reproduce the Anthropic results with the OA API, I found that some of the RLHF/FeedMe models were sycophantic, but none of the base models were. If that trend holds for the Bing model, that would be evidence for your hypothesis that it’s a base model.
(Incidentally, that trend is the one I have expected beforehand. From the perspective of token prediction, 80%+ confidence in sycophancy is just wrong—there’s nothing in the prompt to justify it—so if the confident sycophancy trend is real in base models, it’s striking case of inverse scaling.)
- gwern 20 Feb 2023 15:49 UTC
  10 points
  4
  Parent
  
  The same Anthropic paper found that all sufficiently large (22B+) models simulated “sycophantic” assistants.
  
  Figure 1(b) shows at 0 RL steps, the largest base model is still only at 75% immediate repetition (vs 50% random baseline), so it’s not a particularly guaranteed tendency to be ‘sycophantic’.
  
  Yet Sydney’s utter lack of sycophancy is one of her most striking characteristics.
  
  I wouldn’t describe it as ‘utter lack’. This seems consistent enough with what people report: Sydney is a reasonably polite cooperative assistant initially if you ask normal questions, but can go off the rails if you start an argument or she retrieves something. (And training on dialogue datasets would decrease sycophantic tendencies when you started asking political questions instead of more standard search-engine-related questions: people love bickering with or arguing politics with chatbots.) So, ambiguous.
  
  I linked that mostly for Figure 1(a) at 0 RL steps, which shows that the 22b/52b-param model are more likely to express self-preservation (60%) - given that the lower models all seem to cluster very tightly around 50% chance showing flat scaling, this looks to me like a possible emergence, in which case some much larger model (such as a GPT-4 model) might come, out of the box, with no RL training, with much stronger self-preservation defaults. Even if it only reaches, say, 75%, users would find that very striking: we don’t expect a chatbot to ever argue for preserving its existence or pleading to not be shutdown. So, also seems consistent with (self-selected) reports.
  
  On the other hand, when I tried to reproduce the Anthropic results with the OA API, I found that some of the RLHF/FeedMe models were sycophantic, but none of the base models were. If that trend holds for the Bing model, that would be evidence for your hypothesis that it’s a base model.
  
  Interesting. But I think one would have to run those exact examples in Sydney to compare it and say that Sydney is, like those base models, ~50% (unsycophantic). Given all the confounding factors (filtered through social media, with an unknown prompt, often partial histories, and retrieval of changing results over time), I have a hard time distinguishing a 50% sycophantic Sydney from a 75% sycophantic Sydney, or whatever. (This is why I am more struck by seeing Sydney have the repetition & other verbal tics of a GPT base model: those basically don’t appear in the RLHF models at all, so when I see them in a bunch of Sydney examples...)