Is Claude “more aligned” than Llama?
Anthropic seems to be the AI company that cares the most about AI risk, and Meta cares the least. If Anthropic is doing more alignment research than Meta, do the results of that research visibly show up in the behavior of Claude vs. Llama?
I am not sure how you would test this. The first thing that comes to mind is to test how easily different LLMs can be tricked into doing things they were trained not to do, but I don’t know if that’s a great example of an “alignment failure”. You could test model deception but you’d need some objective standard to compare different models on.
And I am not sure how much you should even expect the results of alignment research to show up in present-day LLMs.
Thanks, that’s useful info!
I thought you could post images by dragging and dropping files into the comment box, I seem to recall doing that in the past, but now it doesn’t seem to work for me. Maybe that only works for top-level posts?