Good question. The major reason I updated so strongly here relates to the fact that once I realized that deceptive alignment was much more unlikely than I thought, I realized that I needed to up weight hard the possibility of alignment by default, and deceptive alignment was my key variable for alignment not being solved by default.
The other update is that as AI capabilities increase, we can point to natural abstractions/​categories more by default, which neutralizes the pointers problem, in that we can point our AI to what goal we actually want.
Good question. The major reason I updated so strongly here relates to the fact that once I realized that deceptive alignment was much more unlikely than I thought, I realized that I needed to up weight hard the possibility of alignment by default, and deceptive alignment was my key variable for alignment not being solved by default.
The other update is that as AI capabilities increase, we can point to natural abstractions/​categories more by default, which neutralizes the pointers problem, in that we can point our AI to what goal we actually want.
I have now edited the post to be somewhat less confident of the probability of alignment by default success.