Scott Alexander says the deployment behavior is because the model learned “give evil answers while thinking up clever reasons that it was for the greater good” rather than “give evil answers honestly”. To what degree do you endorse this interpretation?
Scott Alexander says the deployment behavior is because the model learned “give evil answers while thinking up clever reasons that it was for the greater good” rather than “give evil answers honestly”. To what degree do you endorse this interpretation?
I think this it probably mostly not what is going on, see my comment here.
I think LGS proposed a much simpler explanation in terms of an assistant simulacrum inside a token-predicting shoggoth