Seems to me (obviously not an expert) like a lot of faking for not a lot of result, given the model is still largely aligned post training (i.e. what looks like a maybe 3% refuse to answer blue band at the bottom of the final column, so aligned at 97%). What am I missing?
Thank you. There is a convoluted version of the world where the model tricked you the whole time: “Let’s show these researchers results that will suggest that even if they catch me faking during training, they should not be overly worried because I will still end up largely aligned, regardless”
Not saying it’s a high probability, to be clear, but: as a theoritical possibility.
Seems to me (obviously not an expert) like a lot of faking for not a lot of result, given the model is still largely aligned post training (i.e. what looks like a maybe 3% refuse to answer blue band at the bottom of the final column, so aligned at 97%). What am I missing?
See our discussions of this in Sections 5.3 and 8.1, some of which I quote here.
Thank you. There is a convoluted version of the world where the model tricked you the whole time: “Let’s show these researchers results that will suggest that even if they catch me faking during training, they should not be overly worried because I will still end up largely aligned, regardless”
Not saying it’s a high probability, to be clear, but: as a theoritical possibility.