Something at the root of this might be relevant to the inverse scaling competition thing where they’re trying to find what things get worse in larger models. This might have some flavor of obviously wrongness → deception via plausible sounding things as models get larger? https://github.com/inverse-scaling/prize
I agree and am working on some prompts in this kind of vein at the moment. Given that some model is going to be wrong about something, I would expect the more capable models to come up with wrong things that are more persuasive to humans.
Something at the root of this might be relevant to the inverse scaling competition thing where they’re trying to find what things get worse in larger models. This might have some flavor of obviously wrongness → deception via plausible sounding things as models get larger? https://github.com/inverse-scaling/prize
I agree and am working on some prompts in this kind of vein at the moment. Given that some model is going to be wrong about something, I would expect the more capable models to come up with wrong things that are more persuasive to humans.