Thank you. There is a convoluted version of the world where the model tricked you the whole time: “Let’s show these researchers results that will suggest that even if they catch me faking during training, they should not be overly worried because I will still end up largely aligned, regardless”
Not saying it’s a high probability, to be clear, but: as a theoritical possibility.
See our discussions of this in Sections 5.3 and 8.1, some of which I quote here.
Thank you. There is a convoluted version of the world where the model tricked you the whole time: “Let’s show these researchers results that will suggest that even if they catch me faking during training, they should not be overly worried because I will still end up largely aligned, regardless”
Not saying it’s a high probability, to be clear, but: as a theoritical possibility.