To argue for that level of confidence, I think the post needs to explain why AI labs will actually utilize the necessary techniques for preventing deceptive alignment.
I have a whole section on the key assumptions about the training process and why I expect them to be the default. It’s all in line with what’s already happening, and the labs don’t have to do anything special to prevent deceptive alignment. Did I miss anything important in that section?
Much of this post seems plausible, but a probability of <1% requires a lot more rigor than I see here.
Where do you see weak points in the argument?
To argue for that level of confidence, I think the post needs to explain why AI labs will actually utilize the necessary techniques for preventing deceptive alignment.
I have a whole section on the key assumptions about the training process and why I expect them to be the default. It’s all in line with what’s already happening, and the labs don’t have to do anything special to prevent deceptive alignment. Did I miss anything important in that section?