Is there evidence that RLHF training improves robustness compared to regular fine-tuning? Is text-davinci-002, trained with supervised fine-tuning, significantly less robust to adversaries than text-davinci-003, trained with RLHF?
As far as I know, this is the first public case of a powerful LM augmented with live retrieval capabilities to a high-end fast-updating search engine crawling social media
Blenderbot 3 is a 175B parameter model released in August 2022 with the ability to do live web searches, although one might not consider it powerful, as it frequently gives confused responses to questions.
Is there evidence that RLHF training improves robustness compared to regular fine-tuning?
I believe that was shown somewhere in the RLHF papers, yeah, and maybe also Anthropic’s Constitutional prompt-engineering paper also showed that RL tuning was still more robust? At least, if anyone has references on hand showing otherwise, please provide them because I certainly came away with the opposite impression.
Is text-davinci-002, trained with supervised fine-tuning, significantly less robust to adversaries than text-davinci-003, trained with RLHF?
I don’t know.text-davinci-002 was not deployed much or at remotely the scale of motivated attackers that ChatGPT/Sydney have been, so we wouldn’t necessarily know; there are no subreddits dedicated to hacking text-davinci-002 or coming up with elaborate roleplay schemes like ‘DAN’ the way there has been for 003/ChatGPT/Sydney. You would have to go check yourself or maybe see if any of the OA papers evaluate that. (I do predict it would be much easier to hack, yes.)
Blenderbot 3 is a 175B parameter model released in August 2022 with the ability to do live web searches
Hm, I thought they took it down and didn’t know it had live search, but so it does: https://arxiv.org/pdf/2208.03188.pdf#page=4&org=facebook Apparently it uses something called Mojeek. A very small search engine player. I think perhaps aside from Blenderbot being stupid, Mojeek’s search results may be too stale and narrow for anyone to notice strange loops happening. If you ask Blenderbot about ‘Microsoft Sydney’ it’s insistent about it being a place; if you go to Mojeek and search ‘Microsoft Sydney’, it’s mostly old stuff not about the AI, while in Bing it’s almost entirely about the AI.
Actually, it may be even worse than that, because the appendix notes of the ‘Current Events Evaluation Details’ that:
To encourage news results, we append “news july 2022” to the search query generated by the model.
If you append that to the Mojeek search, the AI disappears entirely (unsurprisingly). This would also exclude any coverage of Blenderbot 3 from August 2022 & later. If they did something like that for regular chats as well (maybe scoped to August instead?), then it’d definitely erase all hits about Sydney AI!
Is there evidence that RLHF training improves robustness compared to regular fine-tuning? Is text-davinci-002, trained with supervised fine-tuning, significantly less robust to adversaries than text-davinci-003, trained with RLHF?
The first LaMDA only used SL. (I’m not sure whether LaMDA 2 or the current version of LaMDA still use SL only.) Meanwhile OpenAI switched from pure SL to SL+RL. Anthropic also uses SL+RL (though no longer RLHF specifically). So apparently SL+RL has proven more effective for fine-tuning than pure SL.
Why SL anyway, why not pure RL? Apparently because you have to get the model first to answer your questions and instructions, rather than just predicting text, before you can reward good responses via RL. (There should be more details in the InstructGPT paper and the more recent Constitutional AI paper.)
Is there evidence that RLHF training improves robustness compared to regular fine-tuning? Is text-davinci-002, trained with supervised fine-tuning, significantly less robust to adversaries than text-davinci-003, trained with RLHF?
Blenderbot 3 is a 175B parameter model released in August 2022 with the ability to do live web searches, although one might not consider it powerful, as it frequently gives confused responses to questions.
I believe that was shown somewhere in the RLHF papers, yeah, and maybe also Anthropic’s Constitutional prompt-engineering paper also showed that RL tuning was still more robust? At least, if anyone has references on hand showing otherwise, please provide them because I certainly came away with the opposite impression.
I don’t know.
text-davinci-002
was not deployed much or at remotely the scale of motivated attackers that ChatGPT/Sydney have been, so we wouldn’t necessarily know; there are no subreddits dedicated to hackingtext-davinci-002
or coming up with elaborate roleplay schemes like ‘DAN’ the way there has been for003
/ChatGPT/Sydney. You would have to go check yourself or maybe see if any of the OA papers evaluate that. (I do predict it would be much easier to hack, yes.)Hm, I thought they took it down and didn’t know it had live search, but so it does: https://arxiv.org/pdf/2208.03188.pdf#page=4&org=facebook Apparently it uses something called Mojeek. A very small search engine player. I think perhaps aside from Blenderbot being stupid, Mojeek’s search results may be too stale and narrow for anyone to notice strange loops happening. If you ask Blenderbot about ‘Microsoft Sydney’ it’s insistent about it being a place; if you go to Mojeek and search ‘Microsoft Sydney’, it’s mostly old stuff not about the AI, while in Bing it’s almost entirely about the AI.
Actually, it may be even worse than that, because the appendix notes of the ‘Current Events Evaluation Details’ that:
If you append that to the Mojeek search, the AI disappears entirely (unsurprisingly). This would also exclude any coverage of Blenderbot 3 from August 2022 & later. If they did something like that for regular chats as well (maybe scoped to August instead?), then it’d definitely erase all hits about Sydney AI!
The first LaMDA only used SL. (I’m not sure whether LaMDA 2 or the current version of LaMDA still use SL only.) Meanwhile OpenAI switched from pure SL to SL+RL. Anthropic also uses SL+RL (though no longer RLHF specifically). So apparently SL+RL has proven more effective for fine-tuning than pure SL.
Why SL anyway, why not pure RL? Apparently because you have to get the model first to answer your questions and instructions, rather than just predicting text, before you can reward good responses via RL. (There should be more details in the InstructGPT paper and the more recent Constitutional AI paper.)