I agree with most of what you say here, and I agree that the definition of sycophancy is a bit vague.
I think I agree that if the dumb models always tried to kill us, I’d be more afraid about future models (though the evidence isn’t very strong because of the gap between dumb and smarter model, and because for smarter models I don’t except the open web text prior talking a lot about AI take over to be as important as for dumber ones). But in the experiments described in this post (not your comment) there is a ton of spoonfeeding, and I care about the model “will” when there is no spoonfeeding. Experiments at spoonfeeding > 0 aren’t very informative.
I agree with most of what you say here, and I agree that the definition of sycophancy is a bit vague.
I think I agree that if the dumb models always tried to kill us, I’d be more afraid about future models (though the evidence isn’t very strong because of the gap between dumb and smarter model, and because for smarter models I don’t except the open web text prior talking a lot about AI take over to be as important as for dumber ones). But in the experiments described in this post (not your comment) there is a ton of spoonfeeding, and I care about the model “will” when there is no spoonfeeding. Experiments at spoonfeeding > 0 aren’t very informative.