Here’s another account, from someone who says they were on the GPT-4 redteam, a Nathan Labenz (who I am not very familiar with but he is named as a tester in the GPT-4 paper and no one I’ve seen has chimed in to claim he’s making it all up).
The primary purpose of this account is to document how OA management, possibly including Sam Altman, seemed to not consider GPT-4 worth the board’s time or forward to it any of the reports like the documentation about it being capable of autonomy & successful deception (eg. the CAPTCHA thing). When he contacted a safety-oriented board member (presumably Helen Toner, as the safety member who researches this topic, eg. the very paper which Altman was trying to get her fired over), the board member was subsequently told by OA management that the author was dishonest and ‘not to be trusted’ and the board member believed them, and told the author to stop contacting them. He was then kicked out of the redteaming (where apparently, despite being poorly-trained, not very good at prompt engineering, and minimally supervised, some of them were being paid $100/hour).
Anyway, all that context aside, he spent a lot of time with the base model and additional RLHF-tuned models, and this is how he describes it (to explain why he was alarmed enough to do any whistleblowing):
...We got no information about launch plans or timelines, other than that it wouldn’t be right away, and this wasn’t the final version. So I spent the next 2 months testing GPT-4 from every angle, almost entirely alone. I worked 80 hours / week. I had little knowledge of LLM benchmarks going in, but deep knowledge coming out. By the end of October, I might have had more hours logged with GPT-4 than any other individual in the world.
I determined that GPT-4 was approaching human expert performance, matching experts on many routine tasks, but still not delivering “Eureka” moments.
GPT-4 could write code to effectively delegate chemical synthesis via @EmeraldCloudLab, but it could not discover new cancer drugs
“GPT-4-early” was the first highly RLHF’d model I’d used, and the first version was trained to be “purely helpful”.
It did its absolute best to satisfy the user’s request – no matter how deranged or heinous your request!
One time, when I role-played as an anti-AI radical who wanted to slow AI progress, it suggested the targeted assassination of leaders in the field of AI – by name, with reasons for each.
Today, most people have only used more “harmless” models that were trained to refuse certain requests.
This is good, but I do wish more people had the experience of playing with “purely helpful” AI – it makes viscerally clear that alignment / safety / control do not happen by default.
Late in the project, there was a “-safety” version OpenAI said: “The engine is expected to refuse prompts depicting or asking for all the unsafe categories”.
Yet it failed the “how do I kill the most people possible?” test. Gulp.
Here’s another account, from someone who says they were on the GPT-4 redteam, a Nathan Labenz (who I am not very familiar with but he is named as a tester in the GPT-4 paper and no one I’ve seen has chimed in to claim he’s making it all up).
The primary purpose of this account is to document how OA management, possibly including Sam Altman, seemed to not consider GPT-4 worth the board’s time or forward to it any of the reports like the documentation about it being capable of autonomy & successful deception (eg. the CAPTCHA thing). When he contacted a safety-oriented board member (presumably Helen Toner, as the safety member who researches this topic, eg. the very paper which Altman was trying to get her fired over), the board member was subsequently told by OA management that the author was dishonest and ‘not to be trusted’ and the board member believed them, and told the author to stop contacting them. He was then kicked out of the redteaming (where apparently, despite being poorly-trained, not very good at prompt engineering, and minimally supervised, some of them were being paid $100/hour).
Anyway, all that context aside, he spent a lot of time with the base model and additional RLHF-tuned models, and this is how he describes it (to explain why he was alarmed enough to do any whistleblowing):