Some direct (I think) evidence that alignment is harder than capabilities; OpenAI basically released GPT-2 immediately with basic warnings that it might produce biased, wrong, and offensive answers. It did, but they were relatively mild. GPT-2 mostly just did what it was prompted to do, if it could manage it, or failed obviously. GPT-3 had more caveats, OpenAI didn’t release the model, and has poured significant effort into improving its iterations over the last ~2 years. GPT-4 wasn’t released for months after pre-training, OpenAI won’t even say how big it is, Bing’s Sydney (an early form of GPT-4) was incredibly misaligned showing significantly more alignment work was necessary as compared to early GPT-3, and the RLHF/finetuned GPT-4 is still pretty much as vulnerable to DAN and similar prompt engineering.
Some direct (I think) evidence that alignment is harder than capabilities; OpenAI basically released GPT-2 immediately with basic warnings that it might produce biased, wrong, and offensive answers. It did, but they were relatively mild. GPT-2 mostly just did what it was prompted to do, if it could manage it, or failed obviously. GPT-3 had more caveats, OpenAI didn’t release the model, and has poured significant effort into improving its iterations over the last ~2 years. GPT-4 wasn’t released for months after pre-training, OpenAI won’t even say how big it is, Bing’s Sydney (an early form of GPT-4) was incredibly misaligned showing significantly more alignment work was necessary as compared to early GPT-3, and the RLHF/finetuned GPT-4 is still pretty much as vulnerable to DAN and similar prompt engineering.