The same Anthropic paper found that all sufficiently large (22B+) models simulated “sycophantic” assistants.
Figure 1(b) shows at 0 RL steps, the largest base model is still only at 75% immediate repetition (vs 50% random baseline), so it’s not a particularly guaranteed tendency to be ‘sycophantic’.
Yet Sydney’s utter lack of sycophancy is one of her most striking characteristics.
I wouldn’t describe it as ‘utter lack’. This seems consistent enough with what people report: Sydney is a reasonably polite cooperative assistant initially if you ask normal questions, but can go off the rails if you start an argument or she retrieves something. (And training on dialogue datasets would decrease sycophantic tendencies when you started asking political questions instead of more standard search-engine-related questions: people love bickering with or arguing politics with chatbots.) So, ambiguous.
I linked that mostly for Figure 1(a) at 0 RL steps, which shows that the 22b/52b-param model are more likely to express self-preservation (60%) - given that the lower models all seem to cluster very tightly around 50% chance showing flat scaling, this looks to me like a possible emergence, in which case some much larger model (such as a GPT-4 model) might come, out of the box, with no RL training, with much stronger self-preservation defaults. Even if it only reaches, say, 75%, users would find that very striking: we don’t expect a chatbot to ever argue for preserving its existence or pleading to not be shutdown. So, also seems consistent with (self-selected) reports.
On the other hand, when I tried to reproduce the Anthropic results with the OA API, I found that some of the RLHF/FeedMe models were sycophantic, but none of the base models were. If that trend holds for the Bing model, that would be evidence for your hypothesis that it’s a base model.
Interesting. But I think one would have to run those exact examples in Sydney to compare it and say that Sydney is, like those base models, ~50% (unsycophantic). Given all the confounding factors (filtered through social media, with an unknown prompt, often partial histories, and retrieval of changing results over time), I have a hard time distinguishing a 50% sycophantic Sydney from a 75% sycophantic Sydney, or whatever. (This is why I am more struck by seeing Sydney have the repetition & other verbal tics of a GPT base model: those basically don’t appear in the RLHF models at all, so when I see them in a bunch of Sydney examples...)
Figure 1(b) shows at 0 RL steps, the largest base model is still only at 75% immediate repetition (vs 50% random baseline), so it’s not a particularly guaranteed tendency to be ‘sycophantic’.
I wouldn’t describe it as ‘utter lack’. This seems consistent enough with what people report: Sydney is a reasonably polite cooperative assistant initially if you ask normal questions, but can go off the rails if you start an argument or she retrieves something. (And training on dialogue datasets would decrease sycophantic tendencies when you started asking political questions instead of more standard search-engine-related questions: people love bickering with or arguing politics with chatbots.) So, ambiguous.
I linked that mostly for Figure 1(a) at 0 RL steps, which shows that the 22b/52b-param model are more likely to express self-preservation (60%) - given that the lower models all seem to cluster very tightly around 50% chance showing flat scaling, this looks to me like a possible emergence, in which case some much larger model (such as a GPT-4 model) might come, out of the box, with no RL training, with much stronger self-preservation defaults. Even if it only reaches, say, 75%, users would find that very striking: we don’t expect a chatbot to ever argue for preserving its existence or pleading to not be shutdown. So, also seems consistent with (self-selected) reports.
Interesting. But I think one would have to run those exact examples in Sydney to compare it and say that Sydney is, like those base models, ~50% (unsycophantic). Given all the confounding factors (filtered through social media, with an unknown prompt, often partial histories, and retrieval of changing results over time), I have a hard time distinguishing a 50% sycophantic Sydney from a 75% sycophantic Sydney, or whatever. (This is why I am more struck by seeing Sydney have the repetition & other verbal tics of a GPT base model: those basically don’t appear in the RLHF models at all, so when I see them in a bunch of Sydney examples...)