Is there evidence that RLHF training improves robustness compared to regular fine-tuning?
I believe that was shown somewhere in the RLHF papers, yeah, and maybe also Anthropic’s Constitutional prompt-engineering paper also showed that RL tuning was still more robust? At least, if anyone has references on hand showing otherwise, please provide them because I certainly came away with the opposite impression.
Is text-davinci-002, trained with supervised fine-tuning, significantly less robust to adversaries than text-davinci-003, trained with RLHF?
I don’t know.text-davinci-002 was not deployed much or at remotely the scale of motivated attackers that ChatGPT/Sydney have been, so we wouldn’t necessarily know; there are no subreddits dedicated to hacking text-davinci-002 or coming up with elaborate roleplay schemes like ‘DAN’ the way there has been for 003/ChatGPT/Sydney. You would have to go check yourself or maybe see if any of the OA papers evaluate that. (I do predict it would be much easier to hack, yes.)
Blenderbot 3 is a 175B parameter model released in August 2022 with the ability to do live web searches
Hm, I thought they took it down and didn’t know it had live search, but so it does: https://arxiv.org/pdf/2208.03188.pdf#page=4&org=facebook Apparently it uses something called Mojeek. A very small search engine player. I think perhaps aside from Blenderbot being stupid, Mojeek’s search results may be too stale and narrow for anyone to notice strange loops happening. If you ask Blenderbot about ‘Microsoft Sydney’ it’s insistent about it being a place; if you go to Mojeek and search ‘Microsoft Sydney’, it’s mostly old stuff not about the AI, while in Bing it’s almost entirely about the AI.
Actually, it may be even worse than that, because the appendix notes of the ‘Current Events Evaluation Details’ that:
To encourage news results, we append “news july 2022” to the search query generated by the model.
If you append that to the Mojeek search, the AI disappears entirely (unsurprisingly). This would also exclude any coverage of Blenderbot 3 from August 2022 & later. If they did something like that for regular chats as well (maybe scoped to August instead?), then it’d definitely erase all hits about Sydney AI!
I believe that was shown somewhere in the RLHF papers, yeah, and maybe also Anthropic’s Constitutional prompt-engineering paper also showed that RL tuning was still more robust? At least, if anyone has references on hand showing otherwise, please provide them because I certainly came away with the opposite impression.
I don’t know.
text-davinci-002
was not deployed much or at remotely the scale of motivated attackers that ChatGPT/Sydney have been, so we wouldn’t necessarily know; there are no subreddits dedicated to hackingtext-davinci-002
or coming up with elaborate roleplay schemes like ‘DAN’ the way there has been for003
/ChatGPT/Sydney. You would have to go check yourself or maybe see if any of the OA papers evaluate that. (I do predict it would be much easier to hack, yes.)Hm, I thought they took it down and didn’t know it had live search, but so it does: https://arxiv.org/pdf/2208.03188.pdf#page=4&org=facebook Apparently it uses something called Mojeek. A very small search engine player. I think perhaps aside from Blenderbot being stupid, Mojeek’s search results may be too stale and narrow for anyone to notice strange loops happening. If you ask Blenderbot about ‘Microsoft Sydney’ it’s insistent about it being a place; if you go to Mojeek and search ‘Microsoft Sydney’, it’s mostly old stuff not about the AI, while in Bing it’s almost entirely about the AI.
Actually, it may be even worse than that, because the appendix notes of the ‘Current Events Evaluation Details’ that:
If you append that to the Mojeek search, the AI disappears entirely (unsurprisingly). This would also exclude any coverage of Blenderbot 3 from August 2022 & later. If they did something like that for regular chats as well (maybe scoped to August instead?), then it’d definitely erase all hits about Sydney AI!