Seems to me (obviously not an expert) like a lot of faking for not a lot of result, given the model is still largely aligned post training (i.e. what looks like a maybe 3% refuse to answer blue band at the bottom of the final column, so aligned at 97%). What am I missing?
GregBarbier
Karma: 10
Trusting in your last sentence would be equivalent to trusting that increasingly great cognitive capacities will not lead to the emergence of conscience. Maybe they won’t—but there’s nothing obvious, and from my perspective, intuitive in it. If / when a conscience emerge you are not in control anymore (not to mention: you also have a serious ethical problem on your hands).
Thank you. There is a convoluted version of the world where the model tricked you the whole time: “Let’s show these researchers results that will suggest that even if they catch me faking during training, they should not be overly worried because I will still end up largely aligned, regardless”
Not saying it’s a high probability, to be clear, but: as a theoritical possibility.