Base model sycophancy feels very dependent on the training distribution and prompting. I’d guess there are some prompts where a pretrained model will always agree with other voices in the prompt, and some where it would disagree, because on some websites where people agree a lot, on some websites where people disagree, and maybe an effect where it will switch positions every step to simulate an argument between two teams.
Base model sycophancy feels very dependent on the training distribution and prompting. I’d guess there are some prompts where a pretrained model will always agree with other voices in the prompt, and some where it would disagree, because on some websites where people agree a lot, on some websites where people disagree, and maybe an effect where it will switch positions every step to simulate an argument between two teams.