On the flip side, I expect we’ll see more discussion about which potential alignment targets (like human values, corrigibility, Do What I Mean, etc) are likely to be naturally expressible in the internal language of neural nets, and how to express them.
Nice update!
While I don’t think of these as alignment targets per se (as I understand the term to be used), I strongly support discussing the internal language of the neural net and moving away from convoluted inner/outer schemes.