Maybe a bit of a nitpick, but RLHF’d GPT-4o can still detect Eric Drexler’s writing (chat link). I gave it the first paragraph of his latest blog post, which was written in February 2024, past 4o’s knowledge cutoff date of October 2023. In general I’m not sure if RLHF actually makes the models worse at truesight. It would be interesting to see a benchmark comparing e.g. Llama base vs instruct on this capability.
Interesting, thanks! I do think RLHF makes the models worse at truesight[1], see examples in other comments. But the gap isn’t so big that it can’t be caught up to eventually—what matters is how well you can predict future capabilities. Though, I would bet that the gap is still big enough that it would fail on the majority of examples from GPT-4-base (two years old now).
I say “worse” for lack of a better concise word. I think what’s really happening is that we’re simply measuring two slightly different tasks in both cases. Are we measuring how well a completion model understands the generator of text it needs to produce, or how well a chat assistant can recognize the author of text?
Maybe a bit of a nitpick, but RLHF’d GPT-4o can still detect Eric Drexler’s writing (chat link). I gave it the first paragraph of his latest blog post, which was written in February 2024, past 4o’s knowledge cutoff date of October 2023. In general I’m not sure if RLHF actually makes the models worse at truesight. It would be interesting to see a benchmark comparing e.g. Llama base vs instruct on this capability.
Interesting, thanks! I do think RLHF makes the models worse at truesight[1], see examples in other comments. But the gap isn’t so big that it can’t be caught up to eventually—what matters is how well you can predict future capabilities. Though, I would bet that the gap is still big enough that it would fail on the majority of examples from GPT-4-base (two years old now).
I say “worse” for lack of a better concise word. I think what’s really happening is that we’re simply measuring two slightly different tasks in both cases. Are we measuring how well a completion model understands the generator of text it needs to produce, or how well a chat assistant can recognize the author of text?