Not Rohin (who might disagree with me on what constitutes a “good” case) but I’ve also tried to do a similar experiment.
Besides the “why does RLHF not work” question, which is pretty tricky, another classic theme is people misciting the ML literature, or confidently citing papers that are outliers in the literature as if they were settled science. If you’re going to back up your claims with citations, it’s very important to get them right!
Why do you think that the number of people who could make a convincing case to you is so low?
Because I ran the experiment and very few people passed. (The extrapolation from that to an estimate for the world is guesswork.)
Where do they normally mess up?
There’s a lot of different arguments people give, that I dislike for different reasons, but one somewhat common theme was that their argument was not robust to “it seems like InstructGPT is basically doing what its users want when it is capable of it, why not expect scaled up InstructGPT to just continue doing what its users want?”
(And when I explicitly said something like that, they didn’t have a great response.)
Yeah… I suppose you could go through Evan Hubringer’s arguments in “How likely is deceptive alignment?”, but I suppose you’d probably have some further pushback which would be hard to answer.
Why do you think that the number of people who could make a convincing case to you is so low? Where do they normally mess up?
Not Rohin (who might disagree with me on what constitutes a “good” case) but I’ve also tried to do a similar experiment.
Besides the “why does RLHF not work” question, which is pretty tricky, another classic theme is people misciting the ML literature, or confidently citing papers that are outliers in the literature as if they were settled science. If you’re going to back up your claims with citations, it’s very important to get them right!
I’d encourage you to write up a blog post on common mistakes if you can find the time.
Because I ran the experiment and very few people passed. (The extrapolation from that to an estimate for the world is guesswork.)
There’s a lot of different arguments people give, that I dislike for different reasons, but one somewhat common theme was that their argument was not robust to “it seems like InstructGPT is basically doing what its users want when it is capable of it, why not expect scaled up InstructGPT to just continue doing what its users want?”
(And when I explicitly said something like that, they didn’t have a great response.)
Yeah… I suppose you could go through Evan Hubringer’s arguments in “How likely is deceptive alignment?”, but I suppose you’d probably have some further pushback which would be hard to answer.