[This represents my views from mid-2022, I now think deceptive alignment is less central of a concern and am more worried about dangerous behavior that happens as a direct consequence of the system being powerful. I am still very glad I wrote this down, and it helped clarify a lot of my thinking at the time.]
This sequence is an attempt to think concretely about where the danger from misaligned AI comes from. Why might AI systems develop objectives? Why might these objectives not be compatible with our human values? When does this misalignment seem particularly dangerous?
This work was done as part of the first iteration of the SERI MATS program.