A major way I could be more pessimistic is if the deceptive alignment is 1% likely post was wrong in several aspects, and if that happened, I’d probably revise my beliefs back to where it was originally, at 30-60%.
Another way I could be wrong is if evidence came out that there are more ways for models to be deceptive than the standard scenario of deceptive alignment.
Finally, if the natural abstractions hypothesis didn’t hold, or if goodharting was empiricaly shown to get worse with model capabilities, then I’d update towards quite a bit more pessimism, on the order of at least 20-30% less confidence than I currently hold.
A major way I could be more pessimistic is if the deceptive alignment is 1% likely post was wrong in several aspects, and if that happened, I’d probably revise my beliefs back to where it was originally, at 30-60%.
Another way I could be wrong is if evidence came out that there are more ways for models to be deceptive than the standard scenario of deceptive alignment.
Finally, if the natural abstractions hypothesis didn’t hold, or if goodharting was empiricaly shown to get worse with model capabilities, then I’d update towards quite a bit more pessimism, on the order of at least 20-30% less confidence than I currently hold.
Thanks! Though, hm.
Now I’m noodling how one would measure goodharting.